VL910
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The
complete Firmware Fix History for
this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/VL-Firmware-Hist.html
|
VL910_135_127 / FW910.30
04/25/19 |
Impact:
Data
Severity: HIPER
New features and functions
- A option was added
to the SMS Remote IPL (RIPL) menus to enable or disable the UDP
checksum calculation for any device type. Previously, this
checksum option was only available for logical LAN devices but now it
extended to all types. The default is for the UDP checksum
calculation to be done, but if this calculation causes errors for the
device, it can be turned off with the new option.
System firmware changes that affect all systems
- HIPER/Non-Pervasive:
A problem was fixed to address potential scenarios
that could
result in undetected data corruption.
- DEFERRED: A
problem was fixed for the USB port having
the wrong location code assigned. The "P1-T4-L1 USB DVD R/RW or
RAM Drive" location code should be "P1-T3-L1". The USB DVD
still works correctly but reported location codes such as in error logs
will have the wrong location code shown. A previous fix for this
problem in FW910.20 did not have the hypervisor portion of the fix, so
the error still occurred after the fix was applied.
This problem only pertains to IBM Power System models S914(9009-41A),
S924(9009-42A), and H924 for SAP HANA (9223-42H).
- DEFERRED:PARTITION_DEFERRED:
A problem was fixed for repeated CPU DLPAR remove operations by Linux
(Ubuntu, SUSE, or RHEL) OSes possibly resulting in a partition
crash. No specific SRCs or error logs are reported.
The problem can occur on any DLPAR CPU remove operation if running on
Linux. The occurrence is intermittent and rare. The
partition crash may result in one or more of the following console
messages (in no particular order):
1) Bad kernel stack pointer addr1 at addr2
2) Oops: Bad kernel stack pointer
3) ******* RTAS CALL BUFFER CORRUPTION *******
4) ERROR: Token not supported
This fix does not activate
until there is a reboot of the partition.
- A problem was fixed for an intermittent IPL failure
with SRC B181E540 logged with fault signature " ex(n2p1c0) (L2FIR[13])
NCU Powerbus data timeout". No FRU is called out. The error
may be ignored and the reIPL is successful. The error occurs very
infrequently.
- A problem was fixed for an IPMI core dump and SRC
B1818601 logged intermittenly when an IPMI session is closed. A
flood of B1818A03 SRCs may be logged after the error occurs. The
IPM server is not impacted and a call home is reported for the
problem. There is no service outage for the IPMI users because of
this.
- A problem was fixed for systems which were running at low
processor frequencies and voltages because the High Frequency Trading
(HFT) Policy had been selected(thereby disabling the On-Chip Controller
(OCC)) but without IBM Support assisting to set the core nest
frequencies to a maximum level. Without the extra manual steps to
set the core frequencies, the system defaults to Safe mode (lowest
frequency and voltage) because it is running without the OCC.
With the fix, the High Frequency Policy menu is hidden in the Advanced
System Management Interface (ASMI) so that only the IBM Support
representative can set the HFT mode while also setting the core
frequencies to the maximum value that can be sustained on that specific
system.
- A problem was fixed for a PCIe Hub checkstop with SRC
B138E504 logged that fails to guard the errant processor chip.
With the fix, the problem hardware FRU is guarded so there is not a
recurrence of the error on the next IPL.
- A problem was fixed for an incorrect SRC of B1810000 being
logged when a firmware update fails because of Entitlement Key
expiration. The error displayed on the HMC and in the OS is
correct and meaningful. With the fix, for this firmware update
failure the correct SRC of B181309D is now logged.
- A problem was fixed for deconfigured FRUs that showed as
Unit Type of "Unknown" in the Advanced System Management Interface
(ASMI). The following FRU type names will be displayed if
deconfigured (shown here is a description of the FRU type as well):
DMI: Processor to Memory Buffer Interface
MC: Memory Controller
MFREFCLK: Multi Function Reference Clock
MFREFCLKENDPT: Muti function reference clock end point
MI: Processor to Memory Buffer Interface
NPU: Nvidia Processing Unit
OBUS_BRICK: OBUS
SYSREFCLKENDPT: System reference clock end point
TPM: Trusted Platform Module
- A problem was fixed for certain SR-IOV adapters where SRC
B400FF01 errors are seen during configuration of the adapter into
SR-IOV mode or updating adapter firmware. This fix updates the
adapter firmware to 11.2.211.37 for the following Feature
Codes: EN15, EN17, EN0H, EN0J, EN0M, EN0N, EN0K, EN0L, EL38,
EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates:
https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for DDR4 2933 MHZ and 3200 MHZ DIMMs
not defaulting to the 2666 MHZ speed on a new DIMM plug, thus
preventing the system from IPLing.
- A problem was fixed for IPMI sessions in the service
processor causing a flood of B181A803 informational error logs on
registry read fails for IPv6 and IPv4 keywords. These error logs
do not represent a real problem and may be ignored.
- A problem was fixed for the HMC in some instances reporting
a VIOS partition as an AIX partition. The VIOS partition can be
used correctly even when it is misidentified.
- A problem was fixed for shared processor partitions going
unresponsive after changing the processor sharing mode of a
dedicated processor partition from "allow when partition is active" to
either "allow when partition is inactive" or "never". This
problem can be circumvented by avoiding disabling processor sharing
when active on a dedicated processor partition. To recover if the
issue has been encountered, enable "processor sharing when active" on
the dedicated partition.
- A problem was fixed for intermittent PCIe correctable
errors which would eventually threshold and cause SRC B7006A72 to be
logged. PCIe performance degradation or temporary loss of
one or more PCIe IO slots could also occur resulting in SRCs B7006970
or B7006971.
- A problem was fixed for I/O adapters not recovering from
multiple, concurrent low-level EEH errors, resulting in a Permanent EEH
error with SRC BA2B000D logged or SRC BA188002 and B7006A22
logged. The affected adapters can be recovered by a re-IPL of the
system. With the fix, the adapters are able to reset and recover
from the simultaneous error conditions. The problem frequency is
low because it requires a second error on a slot that is already frozen
with an error and going into a reset.
- A problem was fixed for hypervisor error logs issued during
the IPL missing the firmware version. This happens on every IPL
for logs generated during the early part of the IPL.
- A problem was fixed for a continuous logging of B7006A28
SRCs after the threshold limit of PCIe Advanced Error Reporting (AER)
correctable errors. The error log flooding can cause error buffer
wrapping and other performance issues.
- A problem was fixed for an error in deleting a partition
with the virtualized Trusted Platform Module (vTPM) enabled and SRC
B7000602 logged. When this error occurs, the encryption process
in the hypervisor may become unusable. The problem can be
recovered from with a re-IPL of the system.
- A problem was fixed in Live Partition Mobility (LPM) of a
partition to a shared processor pool, which results in the partition
being unable to consume uncapped cycles on the target system. To
prevent the issue from occurring, partitions can be migrated to the
default shared processor pool and then dynamically moved to the desired
shared processor pool. To recover from the issue, use DLPAR to
add or remove a virtual processor to/from the affected partition,
dynamically move the partition between shared processor pools, reboot
the partition, or re-IPL the system.
- A problem was fixed for informational (INF) errors for the
PCIe Hub (PHB) at a threshold limit causing the I/O slots to go
non-operational. The system I/O can be recovered with a
re-IPL.
- A problem was fixed for errors in the PCIe Host Bridge
(PHB) performance counters collected by the 24x7 performance monitor.
- A problem was fixed for partitions becoming unresponsive or
the HMC not being able to communicate with the system after a processor
configuration change or a partition power on and off.
- A new SRC of B7006A74 was added for PHB LEM 62 errors
that had surpassed a threshold in the path of the #EMX0
expansion drawer. This replaces the SRC B7006A72 to have a
correct callout list. Without the fix, when B7006A72 is logged
against a PCIe slot in the CEC containing a cable card, the FRUs in the
full #EMX0 expansion drawer path should be considered (use the B7006A8B
FRU callout list as a reference).
- A problem was fixed for eight or more simultaneous Live
Partition Mobility (LPM) migrations to the same system possibly failing
in validation with the HMC error message of "HSCL0273 A command that
was targeted to the managed system has timed out". The problem
can be circumvented by doing the LPM migrations to the same system in
smaller batches.
- A problem was fixed for a boot failure using a N_PORT ID
Virtualization (NPIV) LUN for an operating system that is installed on
a disk of 2 TB or greater, and having a device driver for the disk that
adheres to a non-zero allocation length requirement for the "READ
CAPACITY 16". The IBM partition firmware had always used an
invalid zero allocation length for the return of data and that had been
accepted by previous device drivers. Now some of the newer device
drivers are adhering to the specification and needing an allocation
length of non-zero to allow the boot to proceed.
- A problem was fixed for a possible boot failure from a
ISO/IEC 13346 formatted image, also known as Universal Disk Format
(UDF).
UDF is a profile of the specification known as ISO/IEC 13346 and is an
open vendor-neutral file system for computer data storage for a broad
range of media such as DVDs and newer optical disc formats. The
failure is infrequent and depends on the image. In rare cases,
the boot code erroneously fails to find a file in the current
directory. If the boot fails on a specific image, the boot of
that image will always fail without the fix.
- A problem was fixed for an intermittent IPL failure
with B181345A, B150BA22, BC131705, BC8A1705, or BC81703
logged with a processor core called out. This is a rare error and
does not have a real hardware fault, so the processor core can be
unguarded and used again on the next IPL.
- A problem was fixed for informational logs flooding
the error log if a "Get Sensor Reading" is not working.
- A problem was fixed which caused network traffic failures
for Virtual Functions (VFs) operating in non-promiscuous multicast
mode. In non-promiscuous mode, when a VF recieves a frame, it
will drop it unless the frame is addressed to the VF's MAC address, or
is a broadcast or multcast addressed frame. With the problem, the
VF drops the frame even though it is multicast, thereby blocking the
network traffic, which can result in ping failures and impact other
network operations. To recover from the issue, turn multicast
promiscuous on. This may cause some unwanted multicast traffic to
flow to the partition.
- A problem was fixed for a hypervisor task getting
deadlocked if partitions are powered on at the same time that SR-IOV is
being configured for an adapter. With this problem, workloads
will continue to run but it will not be possible to change the
virtualization configuration or power partitions on and off. This
error can be recovered by doing a re-IPL of the system with a scheduled
outage.
- A problem was fixed for hypervisor tasks getting deadlocked
that cause the hypervisor to be unresponsive to the HMC ( this shows as
an incomplete state on the HMC) with SRC B200F011 logged. This is
a rare timing error. With this problem, OS workloads will
continue to run but it will not be possible for the HMC to interact
with the partitions This error can be recovered by doing a re-IPL
of the system with a scheduled outage.
- A problem was fixed for broadcast bootp installs or boots
that fail with a UDP checksum error.
- A problem was fixed for failing to boot from an AIX mksysb
backup on a USB RDX drive with SRCs logged of BA210012, AA06000D, and
BA090010. The boot error does not occur if a serial console is
used to navigate the SMS menus.
- A problem was fixed error recovery from loss of VPD for
FRUs caused by a stuck I2C bus. When this problem occurs, there
is a flood of B1561312 SRCs with fault signature "
IVPD_REASON_IIC_FDAL_READ_FAIL errno 72". This is a rare problem that
occurs if the I2C slave gets stuck low for some reason. To
recover from this problem, A/C power cycle the system. With the
fix, the I2C bus is reset so the VPD reads for the FRU can be retried
without user intervention until successful.
- A security bypass vulnerability problem was fixed in
the service processor secure socket layer (SSL) which could allow an
attacker to make unauthorized reads on a rejected SSL connection. The
Common Vulnerabilities and Exposures issue number is CVE-2017-3737.
- A security problem was fixed in the service processor
Network Security Services (NSS) services which, with a
man-in-the-middle attack, could provide false completion or errant
network transactions or exposure of sensitive data from intercepted SSL
connections to ASMI, Redfish, or the service processor message
server. The Common Vulnerabilities and Exposures issue number is
CVE-2018-12384.
- A security problem was fixed in the service processor
OpenSSL support that could cause secured sockets to hang, disrupting
HMC communications for system management and partition
operations. The Common Vulnerabilities and Exposures issue number
is CVE-2018-0732.
- A security problem was fixed in the service processor TCP
stack that would allow a Denial of Service (DOS) attack with TCP
packets modified to trigger time and calculation expensive calls.
By sending specially modified packets within ongoing TCP sessions with
the Management Consoles, this could lead to a CPU saturation and
possible reset and termination of the service processor.
The Common Vulnerabilities and Exposures issue number is CVE-2018-5390.
- A security problem was fixed in the service processor TCP
stack that would allow a Denial of Service (DOS) attack by allowing
very large IP fragments to trigger time and calculation expensive calls
in packet reassembly. This could lead to a CPU saturation and
possible reset and termination of the service processor.
The Common Vulnerabilities and Exposures issue number is
CVE-2018-5391. With the fix, changes were made to lower the IP
fragment thresholds to invalidate the attack.
|
VL910_127_127 / FW910.21
03/18/19 |
Impact: Data
Severity: HIPER
System firmware changes that
affect all systems
- HIPER/Pervasive:
DISRUPTIVE: A problem was fixed where, under certain
conditions, a Power Management Reset (PM Reset) event may result in
undetected data corruption. PM Resets occur under various
scenarios such as power management mode changes between Dynamic
Performance and Maximum Performance, Concurrent FW updates, power
management controller recovery procedures, or system boot.
- A problem was fixed for a system terminating if there was
even one predictive or recoverable SRC. For this problem, all
hardware SRCs logged are treated as terminating SRCs. For this
behavior to occur, the initial service processor boot from the AC power
off state failed to complete cleanly, instead triggering an internal
reset (a rare error), leaving some parts of the service processor
not initialized. This problem can be recovered by doing an AC
power cycle, or concurrently on an active system with the assistance of
IBM support.
|
VL910_122_089 / FW910.20
12/12/18 |
Impact:
Data
Severity: HIPER
New features and functions
- Support was
enabled for eRepair spare lane deployment for fabric and memory buses.
System firmware changes that affect all systems
- HIPER/Non-Pervasive:DEFERRED:
A problem was fixed for a potential problem with I/O that could
result in undetected data corruption.
- DEFERRED: A
problem was fixed for DASD VRM reduced stability margins leading to a
possible system shutdown due to temperature component aging over a long
period of time. The DASD VRM is not updated with the fix until
after the system IPLs from a powered off state. It is recommended
that this fix be activated as soon as possible but fix activation
should not be delayed for more than three months maximum.
- DEFERRED: A
problem was fixed for PCIe and SAS adapters in slots attached to a PLX
(PCIe switch) failing to initialize and not being found by the
Operating System. The problem should not occur on the first IPL
after an AC power cycle, but subsequent IPLs may experience the problem.
- DEFERRED: A
problem was fixed for the PCIe3 I/O expansion drawer (#EMX0) links to
improve stability. Intermittent training failures on the
links occurred during the IPL with SRC B7006A8B logged. With the
fix, the link settings were changed to lower the peak link signal
amplification to bring the signal level into the middle of the
operating range, thus improving the high margin to reduce link training
failures. The system must be re-IPLed for the fix to activate.
Without the fix, the system can be powered off and the re-IPLed to
restore the PCIe links.
- DEFERRED:
A problem was fixed for concurrent maintenance operations for PCIe
expansion drawer cable cards and PCI adapters that could cause
loss of system hardware information in the hypervisor with these side
effects: 1) partition secure boots could fail with SRC BA540100
logged.; 2) Live Partition Mobility (LPM) migrations could be blocked;
3) SR-IOV adapters could be blocked from going into shared mode; 4)
Power Management services could be lost; and 5) warm re-IPLs of the
system can fail. The system can be recovered by powering off and
then IPLing again.
- DEFERRED: A
problem was fixed for predictive error logs occurring on the IPL
following a DIMM error recovery. These logs, related to failed
memory scrubbing, have the following "Signature Description":
"mba(n0p15c1) () ERROR: command complete analysis failed". These
error logs do not indicate a hardware problem and may be ignored.
- A problem was fixed for link speed for PCIe Generation 4
adapters showing as "unknown" in the Advanced System Management
Interface (ASMI) PCIe Hardware Topology menu.
- A problem was fixed for differential memory interface (DMI)
lane sparing to prevent shutting down a good lane on the TX side of the
bus when a lane has been spared on the RX side of the bus. If
the XBUS or DMI bus runs out of spare lanes, it can checkstop the
system, so the fix helps use these resources more efficiently.
- A problem was fixed for IPL failures with SRC
BC50090F when replacing Xbus FRUs. The problem occurs if VPD has
a stale bad lane record and that record does not exist on both ends of
the bus.
- A problem was fixed for a firmware update concurrent remove
and activate that fails in the hypervisor during the activate with SRC
B7000AFF. To recover the system, do a re-IPL and it will be at
the correct firmware level that is expected for the remove operation.
- A problem was fixed for a flood of BC130311 SRCs that could
occur when changing Energy Scale Power settings, if the Power
Management is in a reset loop because of errors.
- A problem was fixed for SR-IOV adapter workloads being
suspended with SRC B400FF01 logged while an internal reset of SR-IOV
virtual function in the hypervisor occurs. This problem is
infrequent and caused by heavy workloads for the adapter or vNIC
failovers. The workloads resume after the virtual function reset
without user intervention.
- A problem was fixed for SR-IOV VFs, where a VF configured
with a PVID priority may be presented to the OS with an incorrect
priority value.
- A problem was fixed for the creation of a vNIC adapter that
may show the MAC address twice and cause confusion. For the AIX
OS, the duplicate MAC address shows on the entstat output. No
recovery is needed for this error except to ignore the extra MAC
address in the ethernet adapter status.
- A problem was fixed to reduce the time to reach a "failed"
status on an SR-IOV adapter for certain persistent errors.
Without the fix, adapter spends an extended period of time in the "not
ready" state, eventually reaching the "failed" state. With
the fix, the adapter is able to go to the "failed" state in less than
30 seconds for the persistent fault.
- A problem was fixed for a SR-IOV Virtual Function (VF)
configured with a PVID that fails to function correctly after a
VF reset. It will allow the receiving of untagged frames but not
be able to transmit the untagged frames.
- A problem was fixed for a SMS ping failure for a SR-IOV
adapter Virtual Function (VF) with a non-zero Port VLAN ID
(PVID). This failure may occur after the partition with the
adapter has been booted to AIX, and then rebooted back to SMS.
Without the fix, residue information from the AIX boot is retained for
the VF that should have been cleared.
- A problem was fixed for SRCs B400FF01 and B200F011
experienced for false SR-IOV adapter errors during Live Partition
Mobility (LPM) migrations of a logical partition with vNIC
clients. The SR-IOV adapter does recover from the errors but
there is delay in the adapter communications while the adapter
recovers. These errors can be ignored when evaluating the outcome
of a LPM migration.
- A problem was fixed for partition SMS menus to display
certain network adapters that were unviewable and not usable as boot
and install devices after a microcode update. The problem network
adapter is still present and usable at the OS. The adapters with
this problem have the following feature codes: EN0A, EN0B, EN0H,
EN0J, EN0K, EN0L, EN15, EL5B, EL38, EL3C, EL56, and EL57.
- A problem was fixed for a Logical LAN (l-lan) device
failing to boot when there is a UDP packet checksum error. With
the fix, there is a new option when configuring a l-lan port in SMS to
enable or disable the UDP checksum validation. If the adapter is
already providing the checksum validation, then the l-lan port needs to
have its validation disabled.
- A problem was fixed for Hostboot error log IDs (EID)
getting reused from one IPL to the next, resulting in error logs
getting suppressed (missing) for new problems on the subsequent
IPLs if they have a re-used EID that was already present in the service
processor error logs.
- A problem was fixed for error log truncation with SRC
B1818A12 logged for the error. This problem occurs only rarely
when creating a combined error log entry that exceeds the error log
entry maximum size. With the fix, these type of combinations are
not done if too large, and two error logs are written instead
- A problem was fixed for coherent accelerator processor
proxy (CAPP) unit errors being called out as CEC hardware
Subsystem instead of PROCESSOR_UNIT.
- A problem was fixed for a Self Boot Engine (SBE)
recoverable error at runtime causing the system to go into Safe Mode.
- A problem was fixed for an IPL that ends with the HMC in
the "Incomplete" state with SRCs B182951C and A7001151 logged.
Partitions may start and can continue to run without the HMC services
available. In order to recover the HMC session, a re-IPL of
the system is needed (however, partition workloads could continue
running uninterrupted until the system is intentionally re-IPLed at a
scheduled time). The frequency of this problem is very low as it
rarely occurs.
- A problem was fixed for a system failure with SRC B700F103
that can occur if a shared-mode SR-IOV adapter is moved from a
high-performance slot to a lower performance slot. This
problem can be avoided by disabling shared mode on the SR-IOV adapter;
moving the adapter; and then re-enabling shared mode.
- A problem was fixed for a rare Live Partition Mobility
migration hang with the partition left in VPM (Virtual Page Mode) which
causes performance concerns. This error is triggered by a
migration failover operation occurring during the migration state of
"Suspended" and there has to be insufficent VASI buffers available to
clear all partition state data waiting to be sent to the migration
target. Migration failovers are rare and the migration state of
"Suspended" is a migration state lasting only a few seconds for most
partitions, so this problem should not be frequent. On the HMC,
there will be an inability to complete either a migration stop or a
recovery operation. The HMC will show the partition as migrating
and any attempt to change that will fail. The system must be
re-IPLed to recover from the problem.
- A problem was fixed for Linux or AIX partitions crashing
during a firmware assisted dump or when using Linux kexec to restart
with a new kernel. This problem was more frequent for the Linux
OS with kdump failing with "Kernel panic - not syncing: Attempted to
kill init" in some cases.
- A problem was fixed for a SR-IOV adapter vNIC configuration
error that did not provide a proper SRC to help resolve the issue of
the boot device not pinging in SMS due to maximum transmission unit
(MTU) size mismatch in the configuration. The use of a vNIC
backing device does not allow configuring VFs for jumbo frames when the
Partition Firmware configuration for the adapter (as specified on the
HMC) does not support jumbo frames. When this happens, the vNIC
adapter will fail to ping in SMS and thus cannot be used as a boot
device. With the fix, the vNIC driver configuration code is
now checking the vNIC login (open) return code so it can issue an SRC
when the open fails for a MTU issue (such as jumbo frame
mismatch) or for some other reason. A jumbo frame is an Ethernet
frame with a payload greater than the standard MTU of 1,500 bytes and
can be as large as 9,000 bytes.
- A problem was fixed for the USB port having the wrong
location code assigned. The "P1-T4-L1 USB DVD R/RW or RAM Drive"
location code should be "P1-T3-L1". The USB DVD still works
correctly but reported location codes such as in error logs will
have the wrong location code shown.
This problem only pertains to IBM Power System models S914(9009-41A),
S924(9009-42A), and H924 for SAP HANA (9223-42H).
- A problem was fixed for SR-IOV adapter dumps hanging with
low-level EEH events causing failures on VFs of other non-target SR-IOV
adapters.
- A problem was fixed for preventing loss of function on an
SR-IOV adapter with an 8MB adapter firmware image if it is placed into
SR-IOV shared mode. The 8MB image is not supported at the
FW910.20 firmware level. With the fix, the adapter with the 8MB
image is rejected with an error without an attempt to load the older
4MB image on the adapter which could damage it. This problem
affects the following SR-IOV adapters: #EC2R/#EC2S with CCIN
58FA; #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for adapters in slots attached to a PLX
(PCIe switch) failing with SRCs B7006970 and BA188002 when a
second and subsequent errors on the PLX failed to initiate PLX
recovery. For this infrequent problem to occur, it requires a
second error on the PLX after recovery from the first error.
- A problem was fixed for an intermittent IPL failure with
SRCs B150BA40 and B181BA24 logged. The system can be
recovered by IPLing again. The failure is caused by a memory
buffer misalignment, so it represents a transient fault that should
occur only rarely.
- A problem was fixed for system termination for a re-IPL
with power on with SRC B181E540 logged. The system can be
recovered by powering off and then IPLing. This problem occurs
infrequently and can be avoided by powering off the system between IPL.
System firmware changes that affect certain systems
- On a system with a Cloud Management Console and a HMC Cloud
Connector, a problem was fixed for memory leaks in the Redfish server
causing Out of Memory (OOM) resets of the service processor.
- On a system witn an IBM i partition, A problem was fixed
for a DLPAR force-remove of a physical IO adapter from an IBM i
partition and a simultaneous power off of the partition causing the
partition to hang during the power off. To recover the partition
from the error, the system must be re-IPLed. This problem is rare
because there is only a 2-second timing window for the DLPAR and power
off to interfere with each other.
- For systems with a shared memory partition, a problem
was fixed for Live Partition Mobility (LPM) migration hang after a
Mover Service Partition (MSP) failover in the early part of the
migration. To recover from the hang, a migration stop command
must be given on the HMC. Then the migration can be retried.
- For a shared memory partition, a problem was fixed
for Live Partition Mobility (LPM) migration failure to an indeterminate
state. This can occur if the Mover Service Partition (MSP)
has a failover that occurs when the migrating partition is in the state
of "Suspended." To recover from this problem, the partition must
be shutdown and restarted.
- On a system with an AMS partition, a problem was fixed for
a Live Partition Mobility (LPM) migration failure when migrating from
P9 to a pre-FW860 P8 or P7 system. This failure can occur if the
P9 partition is in dedicated memory mode, and the Physical Page Table
(PPT) ratio is explicitly set on the HMC (rather than keeping the
default value) and the partition is then transitioned to Active Memory
Sharing (AMS) mode prior to the migration to the older system.
This problem can be avoided by using dedicated memory in the partition
being migrated back to the older system.
- On a system with an active IBM i partition, a problem was
fixed for a SPCN firmware download to the PCIe3 I/O expansion drawer
(feature #EMX0) Chassis Management Card (CMC) that could possibly get
stuck in a pending state. This failure is very unlikely as it
would require a concurrent replacement of the CMC card that is loaded
with a SPCN level that is older than 2015 (01MEX151012a). The
failure with the SPCN download can be corrected by a re-IPL of the
system.
|
VL910_115_089 / FW910.11
10/17/18 |
Impact: Availability
Severity: SPE
System firmware changes that affect all systems
- DEFERRED:
A problem was fixed for an incorrect power on sequence for the PCI
PERST signal for I/O adapters. This signal is used to indicate to
the I/O adapters when the reference clock for the device has become
valid and, with the problem, that valid indication may arrive before
the clock is ready. In rare cases, this could intermittently
result in unexpected behavior from the I/O devices such as adapter PCIe
links not training or the adapter not being available for the Operating
System after an IPL. This problem can be recovered from by a
re-IPL of the system.
- A
problem was fixed for system dumps failing with a kernel panic on the
service processor because of an out of memory condition. Without
the
fix, the system dump may be tried again after the reset of the service
processor as the reset would have cleaned up the memory usage.
- A problem was fixed for recovered (correctable) errors
during the IPL being logged as Predictive Errors. There are no
customer actions required for the recovered errors. With the fix,
the corrected errors are marked as "RECOVERED" and logged as
Informational.
- A problem was fixed for an Emergency Power Off Warning
(EPOW) IPL failure that would occur on a loss of a power supply or a
missing power supply. With the fix, the EPOW error will not occur
on the IPL as long as there is one functional power supply available
for the system.
System firmware changes that affect certain systems
- On systems which do not have an HMC attached, a
problem was fixed
for a firmware update initiated from the Operating System (OS) from
FW910.00, FW910.01 or FW910.10 to FW910.11 that caused a system
crash
one hour after the code update completed. This does not fix the
case
of the OS initiated firmware update back to earlier FW910.XX levels
from FW910.11 which can stilll result in a crash of the system.
Do not
initiate a code update from FW910.11 to a lesser FW910 level via
the
OS. Use only HMC or USB methods of code update for this
case. If an
HMC or USB code update is not an option, please contact IBM
support.
- On a system with a partition that has had processors
dynamically
removed, a problem was fix for a partition dump IPL that may experience
unexpected behavior including system crashes. This problem may be
circumvented by stopping and re-starting the partition with the removed
processors prior to requesting a partition dump.
|
VL910_107_089 / FW910.10
09/05/18 |
Impact: Availability
Severity: SPE
System firmware changes that may require customer actions
prior to the firmware update
- DEFERRED: On
a system with a partition with dedicated processors that are set to
allow processor sharing with "Allow when partition is active" or "Allow
always", a problem was fixed for a concurrent firmware update
from FW910.01 that may cause the system to hang. This fix is
deferred, so it is not active until after the next IPL of the system,
so precautions must be taken to protect the system. Perform the
following steps to determine if your system has a partition with
dedicated processors that are set to share. If these partitions
exist, change them to not share processors while active; or shut down
the affected partitions; or do a disruptive update to put on this
service pack.
1) From the HMC command line, Run: lssyscfg -r sys -F name
2) For each system you intend to update firmware, issue the following
HMC command:
lshwres -m <System Name> --level lpar -r proc -F
lpar_name,curr_sharing_mode,pend_sharing_mode
replacing <System Name> with the name as displayed by the first
command.
3) Scan the output for "share_idle_procs_active" or
"share_idle_procs_always". This identifies the affected
partitions.
4) You need to take one of the three options below to install this
firmware level:
a) if affected partitions found, change the lapr to "never allow" or
"allow when partition is inactive" on the lpar settings, and set back
the value to its original value after the code update. These
changes are concurrent when performed on the lpar settings and not in
the profile.
b) Or, shut down partitions identified in step 3. Proceed
with concurrent code update. Then restart the partitions.
c) Or, apply the firmware update disruptively (power off system
and install) to prevent a possible system hang.
New features and functions
- A change was
made to improve IPL performance for a system with a new DIMM installed
or for a system doing its first IPL. The performance is gained by
decreasing the amount of time used in memory diagnostics, reducing IPL
time by as much as 15 minutes, depending on the amount of memory
installed.
- Support was added for 24x7 data collection from the On-Chip
Controller sensors.
- Support was added to correctly isolate TOD faults with
appropriate callouts and failover to the backup topology, if
needed. And to do a reconfiguration of a backup topology to
maintain TOD redundancy.
- Support was disabled for erepair spare lane deployment for
fabric and memory buses. By not using the FRU spare hardware for
an erepair, the affected FRUs may have to be replaced sooner.
Prior to this change, the spare lane deployment caused extra error
messages during runtime diagnostics. When the problems with spare
lane deployment are corrected, this erepair feature will be enabled
again in a future service pack.
System firmware changes that affect all systems
- A security problem was fixed in the DHCP client on the
service processor for an out-of-bound memory access flaw that could be
used by a malicious DHCP server to crash the DHCP client process.
The Common Vulnerabilities and Exposures issue number is CVE-2018-5732.
- DEFERRED: A
problem was fixed for PCIe link stability errors during the IPL for the
PCIe3 I/O Expansion Drawer (Feature code #EMX0) with Active Optical
Cables (AOCs). One or more of the following SRCs may be logged at
completion of IPL: B7006A72, B7006A8B, B7006971, and 10007900.
The fix improves PCIe link stability for this feature.
- DEFERRED: A
problem was fixed for an erroneous SRC
11007610 being logged when hot-swapping CEC fans. This SRC may be
logged if there is more than a two-minute delay between removing
the old fan and installing the new fan. The error log may be
ignored.
- DEFERRED: A
problem was fixed for a hot plug
of a new 1400W power supply that fails to turn on. The
problem is intermittent, occurring more frequently for the cases where
the hot plug insertion action was too slow and maybe at a slight angle
(insertion not perfectly straight). Without the fix, after
a hot plug has been attempted, ensure the power supply LEDs are
on. If the LEDs are not on, retry the plug of the power
supply using a faster motion while keeping the angle of insertion
straight.
- DEFERRED: A
problem was fixed for a host reset of the
Self Boot Engine (SBE). Without the fix, the reset of the SBE
will hang during error recovery and that will force the system into
Safe Mode. Also, a post dump IPL of the system after a
system Terminate Immediate will not work with a hung SBE, so a re-IPL
of the system will be needed to recover it.
- A problem was fixed for an enclosure LED not being lit when
there is a fault on a FRU internal to closure that does not have an LED
of its own. With the fix, the enclosure LED is lit if any FRUs
within the enclosure have a fault.
- A problem was fixed for DIMMs that have VPP shorted to
ground not being called out in the SRC 11002610 logged for the power
fault. The frequency of this problem should be rare.
- A problem was fixed for the Advanced System Management
Interface (ASMI) option for resetting the system to factory
configuration for not returning the Speculative Execution setting to
the default value. The reset to factory configuration does not
change the current value for Speculative Execution. To restore
the default, ASMI must be used manually to set the value. This
problem only pertains to the IBM Power System H922 for SAP HANA
(9223-22H) and the IBM Power System H924 for SAP HANA (9223-42H).
- A problem was fixed for the system early power warning
(EPOW) to be issued when only three of the four power supplies are
operation (instead of waiting for all four power supplies to go down).
- A problem was fixed for a failing VPP voltage regulator
possibly damaging DIMM with too high of a voltage level. With the
fix, the voltage to the DIMMs is shutdown if there is a problem with
voltage regulator to protect the DIMMs.
- A problem was fixed for an unplanned power down of the
system with SRC UE 11002600 logged when a unsupported device was
plugged into the service processor USB 2.0 ports on either of the
slots P1-C1-T1 or P1-C1-T2. This happened when a USB 3.0
DVD drive was plugged into the USB 2.0 slot and caused an overcurrent
condition. The USB 3.0 device was incorrectly not downward
compatible with the USB 2.0 slot. With the fix, such incompatible
devices will cause an informational log but will not cause a power off
of the system.
- A problem was fixed for the On-Chip Controller being able
to sense the current draw for the 12V PCIE adapters that are plugged
into channel 0 (CH0) of the APSS. CH0 was not enabled meaning
anything plugged into those connectors would not be included in the
total server power calculation which could impact power capping.
The system could run at higher power than expected without CH0 being
monitored.
- A problem was fixed for the TPM card LED so that it is
activated correctly.
- A problem was fixed for VRMs drawing current over the
specification. This occurred whenever heavy work loads went above
372 amps with WOF enabled. At 372 amps, a rollover to value "0"
for the current erroneously occurred and this allowed the frequency of
the processors in the system to exceed the normally expected values.
- A problem was fixed for Dynamic Memory Deallocation (DMD)
failing for memory configurations of 3 or 6 Memory Controller (MC)
channels per group. An error message of "Invalid MCS per group
value" is logged with SRC BC23E504 for the problem. If DMD was
working correctly for the installed memory but then began failing at a
later time, it may have been triggered by a guard of a DIMM which
resulted in a memory configuration that is susceptible to the problem
with DMD.
- A problem was fixed for a system with CPU part number
2CY058 and CCIN 5C25 to achieve a slightly more optimum frequency
for one specific EnergyScale Mode, Dynamic Performance Mode.
- A problem was fixed for a missing memory throttle
initialization that in a rare case could lead to an emergency shutdown
of the system. The missing initialization could cause the DIMMs
to oversubscribe to the power supplies in the rare failure mode where
the On-Chip Controller (OCC) fails to start and the Safe Mode default
memory throttle values are too high to stop the memory from overusing
the power from the power supplies. This could cause a power fault
and an emergency shutdown of the system.
- A problem was fixed for a memory translation error that
causes a request for a page of memory to be de-allocated to be
ignored in Dynamic Memory Deallocation (DMD). This misses the
opportunity to proactively relocate a partition to good memory and
running on bad memory may eventually cause a crash of the partition.
- A problem was fixed for an extraneous error log with
SRC BC50050A that has no deconfgured FRU. There was a recovered
error for a single bit in memory that requires no user action.
The BC50050A error log should be ignored.
- A problem was fixed for Hostboot error logs reusing
EID numbers for each IPL. This may cause a predictive error log
to go missing for a bad FRU that is guarded during the IPL. If
this happens, the FRU should be replaced based on the presence of the
guard record.
- A problem was fixed for a rare non-correctable memory
error in the service processor Self Boot Engine (SBE) causing a
Terminate Immediate (TI) for the system instead of recovering from the
error. With the fix, the SBE is working such that all SBE errors
are recoverable and do not affect the system work loads. This SBE
memory provides support for On-Chip Controller (OCC) tasks to the
service processor SBE but it is not related to the system memory used
for the hypervisor and host partition tasks.
- A problem was fixed for extraneous Predictive Error
logs of SRC B181DA96 and SRC BC8A1A39 being logged if the Self Boot
Engine (SBE) halts and restarts when the system host OS is
running, These error logs can be ignored as the SBE
recovers without user intervention.
- A problem was fixed for error logging for rare Low
Pin Count (LPC) link errors between the Host processor and the Self
Boot Engine (SBE). The LPC was changed to timeout instead of
hanging on a LPC error, providing helpful debug data for the LPC error
instead of system checkstop and Hostboot crash.
- A problem was fixed for the reset of the Self Boot
Engine (SBE) at run time to resolve SBE errors without impacting
the hypervisor or the running partitions.
- A problem was fixed for the ODL link in Open CAPI in
the case where ODL Link 1 (ODL1) is used and ODL Link 0 (ODL0) is not
used. As a circumvention, the errors are resolved if ODL 0 is
used instead, or in conjunction with the ODL1.
- A problem was fixed for the wrong DIMM being called out on
an over-temperature error with a SRC B1xx2A30 error log.
- A problem was fixed for adding a non-cable PCIe card
into a slot that was previously occupied by a PCIe3 Optical or Copper
Cable Adapter for the PCIe3 Expansion Drawer
The PCIe new card could fail with a I2C error with SRC BC100706
logged.
- A problem was fixed for call home data for On-Chip
Controller (OCC) error log sensor data being off in alignment by one
sensor. By visually shifting the data, the valid data values can
still be determined from the call home logs.
- A problem was fixed for slow hardware dumps that include
failed processor cores that have no clock signal. The dump
process was waiting for core responses and had to wait for a time-out
for each chip operation, causing dumps to take several hours.
With the fix, the core is checked for a proper clock, and if one does
not exist, the chip operations to that core are skipped to speed up the
hardware dump process significantly.
- A problem was fixed for ipmitool not being able to set the
system power limit when the power limit is not activated with the
standard option. With the fix, the ipmitool user can
activate the power limit "dcmi power activate" and then set the power
limit "dcmi power set _limit xxxx" where "xxxx" in the new
power limit in Watts.
- A problem was fixed for the OBUS to make it OpenCAPI
capable by increasing its frequency from 1563 Mhz to 1611 Mhz.
- A problem was fixed for a Workload Optimized Frequency
(WOF) reset limit failure not providing an Unrecoverable Error (UE) and
a callout for the problem processor. When the WOF reset limit is
reached and failed, WOF is disabled and the system is not running at
optimized frequencies.
- A problem was fixed for the callout of SRC BA188002 so it
does display three trailing extra garbage characters in the location
code for the FRU. The string is correct up to the line ending
white space, so the three extra characters after that should be
ignored. This problem is intermittent and does not occur for all
BA188002 error logs.
- A problem was fixed for the callout of scan ring failures
with SRC BC8A285E and SRC BC8A2857 logged but with no callout for the
bad FRU.
- A problem was fixed for the On-Chip Controller (OCC)
possibly timing out and going to Safe Mode when a system is changed
from the default maximum performance mode (Workload Optimized Frequency
(WOF) enabled) to nominal mode (WOF disabled) and then back to maximum
performance (WOF enabled again). Normal performance can be
recovered with a re-IPL of the system.
- A problem was fixed for the periodic guard reminder causing
a reset/reload of the service processor when it found a symbolic FRU
with no CCIN value in the list of guarded FRUs for the
system. Periodically as periodic guard reminder is run,
every 30 days by default, this problem can cause recoverable errors on
the service processor but with no interruption to the workloads on the
running partitions.
- A problem was fixed for a wrong SubSystem being logged in
the SRC B7009xxxx for Secure Boot Errors. "I/O Subsystem" is
displayed instead of the correct SubSystem value of "System Hypervisor
Firmware".
- A problem was fixed for the lost recovery of a failed Self
Boot Engine (SBE). This may happen if the SBE recovery occurs
during a reset of the service processor. Not only is the recovery
lost, but the error log data for the SBE failure may also be not be
written to the error log. If the SBE is failed and not recovered,
this can cause the post-dump IPL after a system Terminate
Immediate (TI) error to not be able to complete. To recover,
power off the system and IPL again.
- A problem was fixed for a missing SRC at the time runtime
diagnostics are lost and the Hostboot runtime services (HBRT) are put
into the failed state.
A B400F104 SRC is logged each time the HBRT hypervisor adjunct
crashed. On the fourth crash in one hour, HBRT is failed with no
further retries but no SRC is logged. Although a unique SRC is
not logged to indicate loss of runtime diagnostic capability, the
B400F104 SRC does include the HBRT adjunct partition ID for Service to
identify the adjunct.
- A problem was fixed for a Novalink enabled partition not
being able to release master from the HMC that results in error
HSCLB95B. To resolve the issue, run a rebuild managed server
operation on the HMC and then retry the release. This occurs when
attempting to release master from HMC after the first boot up of a
Novalink enabled partition if Master Mode was enforced prior to the
boot.
- A problem was fixed for an UE memory error causing an
entire LMB of memory to deallocate and guard instead of just one page
of memory.
- A problem was fixed for all variants (this was partially
fixed in an earlier release) for the SR-IOV firmware adapter updates
using the HMC GUI or CLI to only reboot one SR-IOV adapter at a
time. If multiple adapters are updated at the same time, the HMC
error message HSCF0241E may occur: "HSCF0241E Could not read
firmware information from SR-IOV device ...". This fix prevents
the system network from being disrupted by the SR-IOV adapter updates
when redundant configurations are being used for the network. The
problem can be circumvented by using the HMC GUI to update the SR-IOV
firmware one adapter at a time using the following steps: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
- A problem was fixed for a rare hypervisor hang caused by a
dispatching deadlock for two threads of a process. The system
hangs with SRC B17BE434 and SRC B182951C logged. This
failure requires high interrupt activity on a program thread that is
not currently dispatched.
- A problem was fixed for a Virtual Network Interface
Controller (vNIC) client adapter to prevent a failover when disabling
the adapter from the HMC. A failover to a new backing device
could cause the client adapter to erroneously appear to be active again
when it is actually disabled. This causes confusion and failures
on the OS for the device driver. This problem can only occur when
there is more than a single backing device for the vNIC adapter and if
a commands are issued from the HMC to disable the adapter and enable
the adapter.
- A possible performance problem was fixed for workloads that
have a large memory footprint.
- A problem was fixed for error recovery in the timebase
facility to prevent an error in the system time. This is an
infrequent secondary error when the timebase facility has failed
and needs recovery.
- A problem was fixed for the HMC GUI and CLI interfaces
incorrectly showing SR-IOV updates as being available for certain
SR-IOV adapters when no updates are
available. This affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC. The "Update
Available" indication in the HMC can be ignored if updates have already
been applied.
- A problem was fixed for the recovery of certain SR-IOV
adapters that fail with SRC B400FF05. This is
triggered by infrequent EEH errors in the adapter. In the
recovery process, the Virtual Function (VF) for the adapter
is rebuilt into the wrong state, preventing the adapter from
working. An HMC initiated disruptive resource dump of the adapter
can recover it. This problem affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV Virtual Functions (VFs)
halting transmission with a SRC B400FF01 logged when many logical
partitions with VFs are shutdown at the same time the adapter is in
highly-active usage by a workload. The recovery process reboots
the failed SR-IOV adapter, so no user intervention is needed to restore
the VF.
- A problem was fixed for VLAN-tagged frames
being transmitted over SR-IOV adapter VFs when the packets should have
instead have been discarded for some VF configuration settings on
certain SR-IOV adapters. This affects the following PCIe
adapters:
#EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; and
#EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV adapter hangs with a
possible SRC B400FF01 logged. This may cause a temporary network
outage while the SR-IOV adapter VF reboots to recover from thje adapter
hang. This problem has been observed on systems with high
network traffic and with many VFs defined.
This fix updates adapter firmware to 1x.22.4021 for the
following Feature Codes: EC2R, EC2S, EC2T, EC2U, EC3L and EC3M.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for a large number (approximately
16,000) of DLPAR adds and removes of SR-IOV VFs to
cause a subsequent DLPAR add of the VF to fail with the
newly-added VF not usable. The large number of
allocations and deallocations caused a leak of a critical SR-IOV
adapter resource. The adapter and VFs may be recovered by an
SR-IOV adapter reset.
- A problem was fixed for a system boot hanging when
recoverable attentions occur on the non-master processor. With
the fix, the attentions on the non-master processor are deferred until
Symmetric multiprocessing (SMP) mode has been established (the point at
which the system is ready for multiple processors to run). This
allows the boot to complete but still have the non-master processor
errors recovered as needed.
- A problem was fixed for certain hypervisor error logs being
slow to report to the OS. The error logs affected are those
created by the hypervisor immediately after the hypervisor is started
and if there is more than 128 error logs from the hypervisor to be
reported. The error logs at the end of the queue take a long time
to be processed, and may make it appear as if error logs are not being
reported to the OS.
- A problem was fixed for a Self Boot Engine (SBE) reset
causing the On-Chip Controller (OCC) to force the system into
Safe Mode with a flood of SRC B150DAA0 and SRC B150DA8A written to the
error log as Information Events.
- A problem was fixed for the Redfish "Manager" request
returning duplicate object URIs for the same HMC. This can occur
if the HMC was removed from the managed system and then later added
back in. The Redfish objects for the earlier instances of the
same HMC were never deleted on the remove.
- A problem was fixed for a possible failure to the service
processor stop state when performing a platform dump. This
problem is specific to dumps being collected for HWPROC
checkstops, which are not common.
- A problem was fixed for SMS menus to limit reporting on the
NPIV and vSCSI configuration to the first 511 LUNs. Without the
fix, LUN 512 through the last configured LUN report with invalid
data. Configurations in excess of 511 LUNs are very rare, and it
is recommended for performance reasons (to be able search for the boot
LUN more quickly) that the number of LUNs on a single targeted be
limited to less than 512.
- The following two errors in the SR-IOV adapter firmware
were fixed: 1) The adapter resets and there is a B400FF01
reference code logged. This error
happens in rare cases when there are multiple partitions actively
running traffic through the adapter. System firmware resets the
adapter
and recovers the system with no
user-intervention required; 2) SR-IOV VFs with defined VLANs and an
assigned PVID are not able to ping each other.
This fix updates adapter firmware to 11.2.211.26 for the following
Feature Codes: EN15, EN17, EN0H,
EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for Field Core Override (FCO) cores
being allocated from a deconfigured processor, causing an IPL failure
with unusable cores. This problem only occurs during the Hostboot
reconfiguration loop in the presence of other processor failures.
- A problem was fixed for a failure in DDR4 RCD (Register
Clock Driver) memory initialization that causes half of the DIMM memory
to be unusable after an IPL. This is an intermittent problem
where the memory can sometimes be recovered by doing another IPL.
The error is not a hardware problem with the DIMM but it is an error in
the initialization sequence needed get the DIMM ready for normal
operations. This supercedes an earlier fix delivered in FW910.01
that intermittently failed to correct the problem.
- A problem was fixed for IBM Product Engineering and Support
personnel not being able to easily determine planar jumper settings in
a machine in order to determine the best mitigation strategies for
various field problems that may occur. With the fix, an
Information Error log is provided on every IPL to provide the planar
jumper settings.
- A problem was fixed for the periodic guard reminder
function to not re-post errorlogs of failed FRUs on each IPL.
Instead, a reminder SRC is created to call home the list of FRUs that
have failed and require service. This puts the system to back to
original behavior of only posting one error log for each FRU that has
failed.
- For a HMC managed system, a problem was fixed for a rare,
intermittent NetsCMS core dump that could occur whenever the system is
doing a deferred shutdown power off. There is no impact to normal
operations as the power off completes, but there are extra error logs
with SRC B181EF88 and a service processor dump.
- A problem was fixed for a Hostboot hang due to deadlock
that can occur if there is a SBE dump in progress that fails. A
failure in the SBE dump can trigger a second SBE dump that deadlocks.
- A problem was fixed for dump performance by decreasing the
amount of time needed to perform dumps by 50%.
- A problem was fixed for an IPL hang that can occur for
certain rare processor errors, where the system is in a loop trying to
isolate the fault.
- A problem was fixed for an enclosure fault LED being stuck
on after a repair of a fan. This problem only occurs after the
second concurrent repair of a fan.
- A problem was fixed for SR-IOV adapters not showing up in
the device tree for a partition that autotboots or starts within a few
seconds of the hypervisor going ready. This problem can be
circumvented by delaying the boot of the partition for at least a
minute after the hypervisor has reached the standby state. If the
problem is encountered, the SR-IOV adapter can be recovered by
rebooting the partition, or DLPAR and remove and add the SR-IOV adapter
to the partition.
- A problem was fixed for a system crash with SRC B700F103
when there are many consecutive configuration changes in the LPARs to
delete old vNICs and create new vNICs, which exposed an infrequent
problem with lock ownership on a virtual I/O slot. There is a
one-to-one mapping or connection between vNIC adapter in the client
LPAR and the backing logical port in the VIOS, and the lock management
needs to ensure that the LPAR accessing the port has ownership to
it. In this case, the LPAR was trying to make usable a device it
did not own. The system should recover on the post dump IPL.
- A problem was fixed for a possible DLPAR add failure of a
PCIe adapter if the adapter is in the planar slot C7 or slot C6 on any
PCIe Expansion drawer fanout module. The problem is more common
if there are other devices or Virtual Functions (VFs) in the same LPAR
that use four interrupts, as this is a problem with the processing
order of the PCIe LSI interrupts.
- A problem was fixed for resource dumps that use the
selector "iomfnm" and options "rioinfo" or "dumpbainfo". This
combination of options for resource dumps always fails without the fix.
- A problem was fixed for missing FFDC data for SR-IOV
Virtual Function (VF) failures and for not allowing the full
architected five minute limit for a recovery attempt for the VF, which
should expand the number of cases where the VF can be recovered.
- A problem was fixed for missing error recovery for memory
errors in non-mirrored memory when reading the SR-IOV adapter firmware,
which could prevent the SR-IOV VF from booting.
- A problem was fixed for a possible system crash if an error
occurs at runtime that requires a FRU guard action. With the fix,
the guard action is restricted to the IPL where it is supported.
- A problem was fixed for a extremely rare IPL hang on a
false communications error to the power supply. Recovery is to
retry the IPL.
- A problem was fixed for the dump content type for HBMEM
(Hostboot memory) to be recognized instead of displaying "Dump Content
Type: not found".
- A problem was fixed for a system crash when an SR-IOV
adapter is changed from dedicated to shared mode with SRC B700FFF and
SRC B150DA73 logged. This failure requires that hypervisor
memory relocation be in progress on the system. This affects the
following PCIe adapters: #EC2R/#EC2S with CCIN 58FA;
#EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for a Live Partition Mobility (LPM)
migration of a partition with shared processors that has an unusable
shared processor that can result in failure of the target partition or
target system. This problem can be avoided by making sure all
shared processors are functional in the source partition before
starting the migration. The target partition or system can be
rebooted to recover it.
- A problem was fixed for hypervisor memory relocation and
Dynamic DMA Window (DDW) memory allocation used by I/O adapter slots
for some adapters where the DDW memory tables may not be fully
initialized between uses. Infrequently, this can cause an
internal failure in the hypervisor when moving the DDW memory for the
adapters. Examples of programs using memory relocation are
Live Partition Mobility (LPM) and the Dynamic Platform Optimizer (DPO).
- A problem was fixed for a partition or system termination
that may occur when shutting down or deleting a partition on a system
with a very large number of partitions (more than 400) or on a system
with fewer partitions but with a very large number of virtual adapters
configured.
- A problem was fixed for when booting a large number of
LPARs with Virtual Trusted Platform Module (vTPM) capability, some
partitions may post a SRC BA54504D time-out for taking too long to
start. With the fix, the time allowed to boot a vTPM LPAR is
increased. If a time-out occurs, the partition can be booted
again to recover. The problem can be avoided by auto-starting
fewer vTPM LPARs, or booting them a couple at a time to prevent
flooding the vTPM device server with requests that will slow the boot
time while the LPARs wait on the vTPM device server responses.
- A problem was fixed for a possible system crash.
- A problem was fixed for a UE B1812D62 logged when a PCI
card is removed between system IPLs. This error log can be
ignored.
- A problem was fixed for USB code update failure if the USB
stick is plugged during an AC power cycle. After the power cycle
completes, the code update will fail to start from the USB
device. As a circumvention, the USB device can be plugged in
after the service processor is in its ready state.
- A problem was fixed for a possible slower migration during
the Live Partition Mobility (LPM) resume stage. For a
migrating partition that does not have a high demand page rate, there
is minimal impact on performance. There is no need for customer
recovery as the migration completes successfully.
- A problem was fixed for firmware assisted dumps (fadump)
and Linux kernel crash dumps (kdump) where dump data is missing.
This can happen if the dumps are set up with chunks greater than 1
Gb in size. This problem can be avoided by setting up
fadump or kdump with multiple 1 Gb chunks.
- A problem was fixed for the I2C bus error logged with SRC
BC500705 and SRC BC8A0401 where the I2C bus was locked up. This
is an infrequent error. In rare cases, the TPM device may hold down the
I2C clock line longer than allowed, causing an error recovery that
times out and prevents the reset from working on all the I2C engine's
ports. A power off and power on of the system should clear the
bus error and allow the system to IPL.
- A problem was fixed for an intra-node, inter-processor
communication lane failure marked in the VPD, causing a secure boot
blacklist violation on the IPL and a processor to be deconfigured with
an SRC BC8A2813 logged.
- A problem was fixed to capture details of failed FRUs into
the dump data by delaying the deconfiguration of the FRUs for checkstop
and TI attentions.
- A problem was fixed for failed processor cores not being
guarded on a memory preserving IPL (re-IPL with CEC powered on).
- A problem was fixed for debug data missing in dumps for
cores which are off-line.
- A problem was fixed for L3 cache calling out a LRU Parity
error too quickly for hardware that is good. Without the fix,
ignore the L3FIR[28] LRU Parity errors unless they are very persistent
with 30 or more occurrences per day.
- A problem was fixed for not having a FRU callout when the
TPM card is missing and causes an IPL failure.
- A problem was fixed for the Advanced System Management
Interface (ASMI) displaying the IPv6 network prefix in decimal instead
of hex character values. The service processor command line
"ifconfig" can be used to see the IPv6 network prefix value in hex as a
circumvention to the problem.
- A problem was fixed for an On-Chip Controller (OCC) cache
fault causing a loss of the OCC for the system without the system
dropping into Safe mode.
- A problem was fixed for system dump failing to collect the
pu.perv SCOMs for chiplets c16 and above which correspond to EQ and EC
chiplets.
Also fixed was the missing SCOM data for the interrupt unit related
"c_err_rpt" registers.
- A problem was fixed for the PCIe topology reports having
slots missing in the "I/O Slot Locations" column in the row for the bus
representing a PCIe switch. This only occurs when the C49
or C50 slots are bifurcated (a slot having two channels).
Bifurcation is done if an NVME module is in the slot or if the slot is
empty (for certain levels of backplanes).
- A problem was fixed for Live Partition Mobility (LPM)
failing along with other hypervisor tasks, but the partitions continue
to run. This is an extremely rare failure where a re-IPL is
needed to restore HMC or Novalink connections to the partitions, or to
do any system configuration changes.
- A problem was fixed for a system termination during a
concurrent exchange of a SR-IOV adapters that had VFs assigned to
it. For this problem, the OS failed to release the VFs but the
error was not returned to the HMC. With the fix, the FRU exchange
gracefully aborts without impacting the system for the case where the
VFs on the SR-IOV adapter remain active.
- A possible performance problem was fixed for partitions
with shared processors that had latency in the handling of the
escalation interrupts used to switch the processor between tasks.
The effect of this is that, while the processor is kept busy, some
tasks might hold the processor longer than they should because the
interrupt is delayed, while others run slower than normal.
- A problem was fixed for a system termination that may occur
with B111E504 logged when starting a partition on a system with a very
large number of partitions (more than 400) or on a system with fewer
partitions but with a very large number of virtual adapters configured.
- A problem was fixed for a system termination that may occur
with a B150DA73 logged when a memory UE is encountered in a partition
when the hypervisor touches the memory. With the fix, the touch
of memory by the hypervisor is a UE tolerant touch and the system is
able to continue running.
- A problem was fixed for fabric errors such as cable pulls
causing checkstops. With the fix, the PBAFIR are changed to
recoverable atentions, allowing the OCC to be reset to recover from
such faults
System firmware changes that affect certain systems
- A problem was fixed to remove a SAS battery LED from ASMI
that does not exist. This problem only pertains to the
S914(9009-41A), S924 (9009-42A) and H924 for SAP HANA (9223-42H) models.
- On a system with an AIX partition, a problem was
fixed for a partition time jump that could occur after doing an AIX
Live Update. This problem could occur if the AIX Live Update
happens after a Live Partition Mobility (LPM) migration to the
partition. AIX applications using the timebase facility could
observe a large jump forwards or backwards in the time reported by the
timebase facility. A circumvention to this problem is to
reboot the partition after the LPM operation prior to doing the AIX
Live Update. An AIX fix is also required to resolve this
problem. The issue will no longer occur when this firmware update
is applied on the system that is the target of the LPM operation and
the AIX partition performing the AIX Live Update has the appropriate
AIX updates installed prior to doing the AIX Live Update.
- On a Linux or IBM i partition which has just
completed a Live Partition Mobility (LPM) migration, a problem was
fixed for a VIO adapter hang when it stops processing interrupts.
For this problem to occur, prior to the migration the adapter must have
had a interrupt outstanding where the interrupt source was disabled.
- On systems with an IBM i partition, support was added
for multipliers for IBM i MATMATR fields that are limited to four
characters. When retrieving Server metrics via IBM MATMATR calls,
and the system contains greater than 9999 GB, for example, MATMATR has
an architected "multiplier" field such that 10,000 GB can be
represented
by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in
the quantity and multiplier fields, respectively, to handle these
extended values. The IBM i OS also requires a PTF to support the
MATMATR field multipliers.
- On a system with a IBM i partition with more than 64 shared
processors assigned to it, a problem was fixed for a system
termination or other unexpected behavior that may occur during a
partition dump. Without the fix, the problem can be avoided by
limiting the IBM i partition to 64 or fewer shared processors.
|
VL910_089_089 / FW910.01
05/30/18 |
Impact: Security
Severity: HIPER
Response for Recent Security Vulnerabilities
- HIPER/Pervasive:
DISRUPTIVE: In response to recently reported security
vulnerabilities, this firmware update is being released to address
Common Vulnerabilities and Exposures issue number CVE-2018-3639. In
addition, Operating System updates are required in conjunction with
this FW level for CVE-2018-3639.
System firmware changes that affect all systems
- HIPER/Pervasive:
A firmware change was made to address a rare case where a memory
correctable error on POWER9 servers may result in an undetected
corruption of data.
- A problem was fixed for Live Partition Mobility (LPM) to
prevent an error in the hardware page translation table for a migrated
page that could result in an invalid operation on the target
system. This is a rare timing problem with the logic used to
invalidate an entry in the memory page translation table.
- A problem was fixed for a hung ethernet port on the service
processor. This hang prevents TCP/IP network traffic from the
management console and the Advanced System Management Interface (ASMI)
browsers. It makes it appear as if the service processor is
unresponsive and can be confused with a service processor in the
stopped state. An A/C power cycle would recover a hung ethernet
adapter.
- A problem was fixed for partition hangs or aborts during a
Live Partition Mobility (LPM) or Dynamic Platform Optimizer (DPO)
operation. This is a rare problem with a small timing window for
it to occur in the hypervisor task dispatching. The partition can
be rebooted to recover from the problem.
- A problem was fixed for service processor static IP
configurations failing after several minutes with SRC B1818B3E.
The IP address will not respond to pings in the ethernet adapter failed
state. This problem occurs any time a static IP configuration is
enabled on the service processor. Dynamic Host Control Protocol
(DHCP) dynamic IPs can be used to provide the service processor network
connections. To recover from the problem, the other ethernet
adapter (either eth0 or eth1) should be in the default DHCP
configuration and allow the failing adapter to be reconfigured with a
dynamic IP.
- A problem was fixed for the system going to ultra turbo
mode after an On-Chip Controller (OCC) reset. This could result
in a power supply over current condition. This problem can happen
when the system is running a heavy workload and then a power mode
change is requested or some error happens that takes the OCC into a
reset.
- A problem was fixed for Workload Optimized Frequency (WOF)
where parts may have been manufactured with bad IQ data that requires
filtering to prevent WOF from being disabled.
- A problem was fixed for transactional memory that could
result in a wrong answer for processes using it. This is a rare
problem requiring L2 cache failures that can affect the process
determining correctly if a transaction has completed.
- A problem was fixed for a change in the IP address of the
service processor causing the system On-Chip Controller (OCC) to go
into Safe mode with SRC B18B2616 logged. In Safe mode, the system
is running with reduced performance and with fans running at high
speed. Normal performance can be restored concurrently by a
reset/reload of the service processor using the ASMI soft reset
option. Without the fix, whenever the IP address of the service
processor is changed, a soft reset of the service processor should be
done to prevent OCC from going into Safe mode.
- A problem was fixed for the recovery for optical link
failures in the PCIe expansion drawer with feature code #EMX0.
The recovery failure occurs when there are multiple PCIe link failures
simultaneously with the result that the I/O drawers become unusable
until the CEC is restarted. The hypervisor will have xmfr entries
with "Sw Cfg Op FAIL" identified. With the fix, the errors will
be isolated to the PCIe link and the I/O drawer will remain operational.
- A problem was fixed for a system aborting with SRC B700F105
logged when starting a partition that the user had changed from
P8 or P9 compatiblity mode to P7 compatibility mode. This problem
is intermittent and the partition in question had to have an immediate
shutdown done prior to the change in compatibility mode for the problem
to happen. To prevent this problem when it is known that a
compatibility mode is going to change to P7 mode, allow the partition
to shut down normally before making the change. If an immediate
shut down of the partition is necessary and the compatibility mode has
to be changed to P7, then the CEC should be powered off and then
re-IPLed before making the change to prevent an unplanned outage of the
system.
- A problem was fixed for a logical partition hang or
unpredictable behavior due to lost interrupts with BCxxE504 logged when
memory is relocated by the hypervisor because of predictive memory
failures. This problem is not frequent because it requires memory
failing and the recovery action of relocating memory away from the
failing DIMMs being taken. To circumvent this failure, if memory
failure has occurred, the system may be re-IPLed to allow normal memory
allocations from the available memory, so the partitions do not have to
run on relocated memory.
- A problem was fixed for a failure in DDR4 RCD (Register
Clock Driver) memory initialization that causes half of the DIMM memory
to be unusable after an IPL. This is an intermittent problem
where the memory can sometimes be recovered by doing another IPL.
The error is not a hardware problem with the DIMM but it is an error in
the initialization sequence needed get the DIMM ready for normal
operations.
System firmware changes that affect certain systems
- DEFERRED:
On a system with only a single
processor core configured, a problem was fixed for poor I/O
performance. This problem may be circumvented by configuring a
second
processor core. This additional processor core does not have to
be
used by the partition.
- On systems that are not managed by a HMC, a problem was
fixed to enable FSP call home. The problem always happens when
the service processor attempts to call home an error log as it will
fail.
- A problem was fixed for Dynamic Power Saver Mode on a
system with low-CPU utilization that had reduced performance
unexpectedly. This only happens for workloads that are restricted
to a single core or using just a single core because of low-CPU
utilization. This problem can be circumvented by running the
system in maximum performance mode by using ASMI to enable Fixed
Maximum Frequency mode.
|
VL910_073_059 / FW910.00
03/20/18 |
Impact:
New
Severity: New
New Features and Functions
|