SC860_160_056 / FW860.50
05/03/18 |
Impact: Availability
Severity: SPE
New features and functions
- Support was added to allow V9R910 and later HMC levels to
query Live Partition Mobility (LPM) performance data after an LPM
operation.
- Support was added to the Advanced System Management
Interface (ASMI) to provide customer control over speculative
execution in response to CVE-2017-5753 and CVE-2017-5715
(collectively known as Spectre) and CVE-2017-5754 (known as
Meltdown). The ASMI "System Configuration/Speculative
Execution Control" provides two options that can only be set when the
system is powered off:
1) Speculative execution controls to mitigate user-to-kernel and
user-to-user side-channel attacks. This mode is designed for
systems that need to mitigate exposures of the hypervisor, operating
systems, and user application data to untrusted code. This
mode is set as the default.
2) Speculative execution fully enabled: This optional mode is
designed for systems where the hypervisor, operating system, and
applications can be fully trusted.
Note: Enabling this option could expose the system to
CVE-2017-5753, CVE-2017- 5715, and CVE-2017-5754. This includes
any partitions that are migrated (using Live Partition Mobility) to
this system.
- Support was added to allow a periodic data capture from the
PCIe3 I/O expansion drawer (with feature code #EMX0) cable card links.
- On systems with an IBM i partition, support was added
for multipliers for IBM i MATMATR fields that are limited to four
characters. When retrieving Server metrics via IBM
MATMATR calls, and the system contains greater than 9999 GB, for
example, MATMATR has an architected "multiplier" field such that 10,000
GB can be represented
by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in
the quantity and multiplier fields, respectively, to handle these
extended values. The IBM i OS also requires a PTF to support the
MATMATR field multipliers.
- On systems with redundant service processors, a health
check was added for the state of the secondary service processor to
verify it matches the state of the primary service processor. If
the state of the secondary service processor is an unexpected value
such as in termination, an SRC is logged and a call home is done for
service processor FRU that has failed.
System firmware changes that affect all systems
- DEFERRED: A
problem was fixed for a PCIe3 I/O expansion drawer (with feature code
#EMX0) where control path stability issues may cause certain SRCs to be
logged. Systems using copper cables may log SRC B7006A87 or
similar SRCs, and the fanout module may fail to become active.
Systems using optical cables may log SRC of B7006A22 or similar
SRCs. For this problem, the errant I/O drawer may be recovered by
a re-IPL of the system.
- A problem was fixed for error logs being collected twice by
the HMC, potentially causing an extra call home for an issue that was
already resolved. This problem was caused by a failover to the
backup service processor whose error log was missing the
acknowledgement from the HMC that error logs had been collected.
This resulted in the error logs being copied onto the HMC as PELs
for a second time.
- A problem was fixed in which deconfigured-resource records
can become malformed and cause the loss of service processor for both
redundant and non-redundant service processor systems. These
failures can occur during or after firmware updates to the FW860.40,
FW860.41, or FW860.42 levels. The complete loss of service
processor results in the loss of HMC (or FSP stand-alone) management of
the server and loss of any further error logging. The server
itself will continue to run. Without the fix, the loss of the
service processor could happen within one month of the deconfiguration
records being encountered. It is highly recommended to install
the fix. Recovery from the problem, once encountered, requires a
full server AC power cycle and clearing of deconfiguration records to
avoid reoccurrence. Clearing deconfiguration records exposes the
server to repeat hardware failures and possible unplanned outages.
- A problem was fixed for the guard reminder processing of
garded FRUs and error logs that can cause a system power off to hang
and time out with a service processor reset.
- A problem was fixed for a system termination that can occur
when doing a concurrent code update from the FW860.30 level with a
clock card deconfigured in the system. Without the fix, this
problem can be avoided by repairing the clock card prior to the code
update or by doing a disruptive code update.
- A problem was fixed for a Coherent Accelerator Processor
Proxy (CAPP) unit hardware failure that caused a hypervisor hang with
SRC B7000602. This failure is very rare and can only occur during
the early IPL of the hypervisor, before any partitions are
started. A re-IPL will recover from the problem.
- A problem was fixed for a Live Partition Mobility migration
hang that could occur if one of its VIOS Mover Service Partitions
(MSPs) goes into a failover at the start of the LPM operation.
This problem is rare because it requires a MSP error to force a MSP
failover at the very start of the LPM migration to get the LPM timing
error. The LPM hang can be recovered by using the "migrlpar -o s"
and "migrlpar -o r" commands on the HMC.
- A problem was fixed for incorrect low affinity scores for a
partition reported from the HMC "lsmemopt" command when a partition has
filled an entire drawer. A low score indicates the placement is
poor but in this case the placement is actually good. More
information on affinity scores for partitions and the Dynamic Platform
Optimizer can be found at the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_dpoovw.htm.
- A problem was fixed to allow the management console to
display the Active Memory Mirroring (AMM) licensed capability.
Without the fix, the AMM licensed capability of a server will always
show as "off" on the management console, even when it is present.
- A problem was fixed for a rare hypervisor hang for systems
with shared processors with a sharing mode of uncapped. If this
hang occurs, all partitions of the system will become unresponsive and
the HMC will go to an "Incomplete" state.
- A problem was fixed for a Live Partition Mobility migration
abort that could occur if one of its VIOS Mover Service Partitions
(MSPs) goes into a failover during the LPM operation. This
problem is rare because it requires a MSP error to force a MSP failover
during the LPM migration to get the LPM timing error. The LPM
abort can be recovered by retrying the LPM migration.
- A problem was fixed for the FRU callouts for the BA188001
and BA188002 EEH errors to include the PCI Host Bridge (PHB) FRU which
had been excluded. For the P8 systems, these rare errors will
more typically isolate to the processor instead of the adapter or slot
planar. In the pre-P8 systems, the I/O planar also included
the PHB, but for P8 systems, the PHB was moved to the processor complex.
- A problem was fixed for an internal error in the
SR-IOV adapter firmware that resets the adapter and logs a B400FF01
reference code. This error happens in rare cases when there are
multiple partitions actively running traffic through the adapter and a
subset of the partitions are shutdown hard. The error causes a
temporary disruption of traffic but recovery from the error is
automatic with no user intervention needed.
This fix updates adapter firmware to 10.2.252.1931, for the following
Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K,
and EN0L.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for the wrong Redfish method (PATCH or
POST) passed for a valid Uniform Resource Indicator (URI) causing an
incorrect error message of " 501 - Not Implemented". With the
fix, the message returned is "Invalid Method on URI" which is more
helpful to the user.
- A problem was fixed for SRC call home reminders for bad
FRUs causing service processor dumps with SRC B181E911 and
reset/reloads. This occurred if the FRU callout was missing a
CCIN number in the error log. This can happen because some error
logs only have have "Symbolic FRUs" and these were not being handled
correctly.
- A problem was fixed for a PCIe3 I/O expansion drawer
(with feature code #EMX0) failing to initialize during the IPL
with a SRC B7006A88 logged. The error is infrequent. The
errant I/O drawer can be recovered by a re-IPL of the system.
- A problem was fixed for the SR-IOV firmware adapter updates
using the HMC GUI or CLI to only reboot one SR-IOV adapter at a
time. If multiple adapters are updated at the same time, the HMC
error message HSCF0241E may occur: "HSCF0241E Could not read
firmware information from SR-IOV device ...". This fix prevents
the system network from being disrupted by the SR-IOV adapter updates
when redundant configurations are being used for the network. The
problem can be circumvented by using the HMC GUI to update the SR-IOV
firmware one adapter at a time using the following steps:
https://www.ibm.com/support/knowledgecenter/en/8247-22L/p8efd/p8efd_updating_sriov_firmware.htm
System firmware changes that affect certain systems
- On systems with a
shared processor pool, a very rare problem was fixed for the hypervisor
not responding to partition requests such as power off and LIve
Partiton Mobility (LPM). This error is caused by a request for a
guard of a failed processor (when there are not any available spare
processors) that has hung.
- On systems with mirrored memory running IBM i partitions, a
problem was fixed for un-mirrored nodal memory errors in the partition
that also caused the system to crash. With the fix, the
memory failure is isolated to the impacted partition, leaving the rest
of the system unaffected. This fix improves on an earlier fix
delivered for IBM i memory errors in FW840.60 by handling
the errors in nodal memory.
- On systems with Huge Page (16 GB) memory enabled for a AIX
partition, a problem was fixed for the OS failing to boot with an
0607 SRC displayed. This error occurs on systems with
FW860.40, FW860.41 or FW860.42 installed. To circumvent the
problem, disable Huge Pages for the AIX partition. For
information on viewing and setting values for AIX huge-page memory
allocation, see the following link in the IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hat/p8hat_aixviewhgpgmem.htm
- On systems with an IBM i partition, a problem was fixed for
64 bytes overwritten in a portion of the IBM i Main Storage Dump
(MSD). Approximately 64 bytes are overwritten just beyond the 17
MB (0x11000000) address on P8 systems. This problem is cosmetic
as the dump is still readable for problem diagnostics and no customer
operations are affected by it.
- On systems with a partition with a Fibre Channel Adapter
(FCA) or a Fibre Channel over Ethernet (FCoE) adapter, a problem
was fixed for bootable disks attached to the FCA or FCoE adapter not
being seen in the System Management Services (SMS) menus for selection
as boot devices. This problem is likely to occur if the only I/O
device in the partition is a FCA or FCoE adapter. If other I/O
devices are present, the problem may still occur if the FCA or
FCoE is the first adapter discovered by SMS. A work-around
to this problem is to define a virtual Ethernet adapter in the
partition profile. The virtual adapter does not need to have any
physical backing device, as just having the VLAN defined is
sufficient to avoid the problem. The FCA has feature codes #EN0A,
#EN0B, #EN0F, #EN0G, #EN0Y, #EN12, #5729, #5774, #5735, and
#5723. The FCoE adapter has feature codes #5708, #EN0H,
#EN0J, #EN0K, and #EN0L
- On systems with a partition with a 3.0 USB controller, a
problem was fixed for a partition boot failure The USB 3.0
controller adapter card with feature code #EC45 or #EC46. The
boot failure is triggered by a fault in the USB controller but instead
of the just the USB controller failing, the entire partition
fails. With the fix, the failure is limited to the USB
controller.
- On a system in a Power Enterprise Pool (PEP) with Mobile
Resources, a problem was fixed for Mobile Resource not being
restored after an IPL. The missing resources can be started
temporarily with Trial COD or some other methods, or the PEP
recovery steps can be used to get the Mobile Resources restored.
For more information, see the Change CoD Pool command on the HMC:
https://www.ibm.com/support/knowledgecenter/en/POWER8/p8edm/chcodpool.html.
|