VL910_107_089 / FW910.10
09/05/18 |
Impact: Availability
Severity: SPE
System firmware changes that may require customer actions
prior to the firmware update
- DEFERRED: On
a system with a partition with dedicated processors that are set to
allow processor sharing with "Allow when partition is active" or "Allow
always", a problem was fixed for a concurrent firmware update
from FW910.01 that may cause the system to hang. This fix is
deferred, so it is not active until after the next IPL of the system,
so precautions must be taken to protect the system. Perform the
following steps to determine if your system has a partition with
dedicated processors that are set to share. If these partitions
exist, change them to not share processors while active; or shut down
the affected partitions; or do a disruptive update to put on this
service pack.
1) From the HMC command line, Run: lssyscfg -r sys -F name
2) For each system you intend to update firmware, issue the following
HMC command:
lshwres -m <System Name> --level lpar -r proc -F
lpar_name,curr_sharing_mode,pend_sharing_mode
replacing <System Name> with the name as displayed by the first
command.
3) Scan the output for "share_idle_procs_active" or
"share_idle_procs_always". This identifies the affected
partitions.
4) You need to take one of the three options below to install this
firmware level:
a) if affected partitions found, change the lapr to "never allow" or
"allow when partition is inactive" on the lpar settings, and set back
the value to its original value after the code update. These
changes are concurrent when performed on the lpar settings and not in
the profile.
b) Or, shut down partitions identified in step 3. Proceed
with concurrent code update. Then restart the partitions.
c) Or, apply the firmware update disruptively (power off system
and install) to prevent a possible system hang.
New features and functions
- A change was
made to improve IPL performance for a system with a new DIMM installed
or for a system doing its first IPL. The performance is gained by
decreasing the amount of time used in memory diagnostics, reducing IPL
time by as much as 15 minutes, depending on the amount of memory
installed.
- Support was added for 24x7 data collection from the On-Chip
Controller sensors.
- Support was added to correctly isolate TOD faults with
appropriate callouts and failover to the backup topology, if
needed. And to do a reconfiguration of a backup topology to
maintain TOD redundancy.
- Support was disabled for erepair spare lane deployment for
fabric and memory buses. By not using the FRU spare hardware for
an erepair, the affected FRUs may have to be replaced sooner.
Prior to this change, the spare lane deployment caused extra error
messages during runtime diagnostics. When the problems with spare
lane deployment are corrected, this erepair feature will be enabled
again in a future service pack.
System firmware changes that affect all systems
- A security problem was fixed in the DHCP client on the
service processor for an out-of-bound memory access flaw that could be
used by a malicious DHCP server to crash the DHCP client process.
The Common Vulnerabilities and Exposures issue number is CVE-2018-5732.
- DEFERRED: A
problem was fixed for PCIe link stability errors during the IPL for the
PCIe3 I/O Expansion Drawer (Feature code #EMX0) with Active Optical
Cables (AOCs). One or more of the following SRCs may be logged at
completion of IPL: B7006A72, B7006A8B, B7006971, and 10007900.
The fix improves PCIe link stability for this feature.
- DEFERRED: A
problem was fixed for an erroneous SRC
11007610 being logged when hot-swapping CEC fans. This SRC may be
logged if there is more than a two-minute delay between removing
the old fan and installing the new fan. The error log may be
ignored.
- DEFERRED: A
problem was fixed for a hot plug
of a new 1400W power supply that fails to turn on. The
problem is intermittent, occurring more frequently for the cases where
the hot plug insertion action was too slow and maybe at a slight angle
(insertion not perfectly straight). Without the fix, after
a hot plug has been attempted, ensure the power supply LEDs are
on. If the LEDs are not on, retry the plug of the power
supply using a faster motion while keeping the angle of insertion
straight.
- DEFERRED: A
problem was fixed for a host reset of the
Self Boot Engine (SBE). Without the fix, the reset of the SBE
will hang during error recovery and that will force the system into
Safe Mode. Also, a post dump IPL of the system after a
system Terminate Immediate will not work with a hung SBE, so a re-IPL
of the system will be needed to recover it.
- A problem was fixed for an enclosure LED not being lit when
there is a fault on a FRU internal to closure that does not have an LED
of its own. With the fix, the enclosure LED is lit if any FRUs
within the enclosure have a fault.
- A problem was fixed for DIMMs that have VPP shorted to
ground not being called out in the SRC 11002610 logged for the power
fault. The frequency of this problem should be rare.
- A problem was fixed for the Advanced System Management
Interface (ASMI) option for resetting the system to factory
configuration for not returning the Speculative Execution setting to
the default value. The reset to factory configuration does not
change the current value for Speculative Execution. To restore
the default, ASMI must be used manually to set the value. This
problem only pertains to the IBM Power System H922 for SAP HANA
(9223-22H) and the IBM Power System H924 for SAP HANA (9223-42H).
- A problem was fixed for the system early power warning
(EPOW) to be issued when only three of the four power supplies are
operation (instead of waiting for all four power supplies to go down).
- A problem was fixed for a failing VPP voltage regulator
possibly damaging DIMM with too high of a voltage level. With the
fix, the voltage to the DIMMs is shutdown if there is a problem with
voltage regulator to protect the DIMMs.
- A problem was fixed for an unplanned power down of the
system with SRC UE 11002600 logged when a unsupported device was
plugged into the service processor USB 2.0 ports on either of the
slots P1-C1-T1 or P1-C1-T2. This happened when a USB 3.0
DVD drive was plugged into the USB 2.0 slot and caused an overcurrent
condition. The USB 3.0 device was incorrectly not downward
compatible with the USB 2.0 slot. With the fix, such incompatible
devices will cause an informational log but will not cause a power off
of the system.
- A problem was fixed for the On-Chip Controller being able
to sense the current draw for the 12V PCIE adapters that are plugged
into channel 0 (CH0) of the APSS. CH0 was not enabled meaning
anything plugged into those connectors would not be included in the
total server power calculation which could impact power capping.
The system could run at higher power than expected without CH0 being
monitored.
- A problem was fixed for the TPM card LED so that it is
activated correctly.
- A problem was fixed for VRMs drawing current over the
specification. This occurred whenever heavy work loads went above
372 amps with WOF enabled. At 372 amps, a rollover to value "0"
for the current erroneously occurred and this allowed the frequency of
the processors in the system to exceed the normally expected values.
- A problem was fixed for Dynamic Memory Deallocation (DMD)
failing for memory configurations of 3 or 6 Memory Controller (MC)
channels per group. An error message of "Invalid MCS per group
value" is logged with SRC BC23E504 for the problem. If DMD was
working correctly for the installed memory but then began failing at a
later time, it may have been triggered by a guard of a DIMM which
resulted in a memory configuration that is susceptible to the problem
with DMD.
- A problem was fixed for a system with CPU part number
2CY058 and CCIN 5C25 to achieve a slightly more optimum frequency
for one specific EnergyScale Mode, Dynamic Performance Mode.
- A problem was fixed for a missing memory throttle
initialization that in a rare case could lead to an emergency shutdown
of the system. The missing initialization could cause the DIMMs
to oversubscribe to the power supplies in the rare failure mode where
the On-Chip Controller (OCC) fails to start and the Safe Mode default
memory throttle values are too high to stop the memory from overusing
the power from the power supplies. This could cause a power fault
and an emergency shutdown of the system.
- A problem was fixed for a memory translation error that
causes a request for a page of memory to be de-allocated to be
ignored in Dynamic Memory Deallocation (DMD). This misses the
opportunity to proactively relocate a partition to good memory and
running on bad memory may eventually cause a crash of the partition.
- A problem was fixed for an extraneous error log with
SRC BC50050A that has no deconfgured FRU. There was a recovered
error for a single bit in memory that requires no user action.
The BC50050A error log should be ignored.
- A problem was fixed for Hostboot error logs reusing
EID numbers for each IPL. This may cause a predictive error log
to go missing for a bad FRU that is guarded during the IPL. If
this happens, the FRU should be replaced based on the presence of the
guard record.
- A problem was fixed for a rare non-correctable memory
error in the service processor Self Boot Engine (SBE) causing a
Terminate Immediate (TI) for the system instead of recovering from the
error. With the fix, the SBE is working such that all SBE errors
are recoverable and do not affect the system work loads. This SBE
memory provides support for On-Chip Controller (OCC) tasks to the
service processor SBE but it is not related to the system memory used
for the hypervisor and host partition tasks.
- A problem was fixed for extraneous Predictive Error
logs of SRC B181DA96 and SRC BC8A1A39 being logged if the Self Boot
Engine (SBE) halts and restarts when the system host OS is
running, These error logs can be ignored as the SBE
recovers without user intervention.
- A problem was fixed for error logging for rare Low
Pin Count (LPC) link errors between the Host processor and the Self
Boot Engine (SBE). The LPC was changed to timeout instead of
hanging on a LPC error, providing helpful debug data for the LPC error
instead of system checkstop and Hostboot crash.
- A problem was fixed for the reset of the Self Boot
Engine (SBE) at run time to resolve SBE errors without impacting
the hypervisor or the running partitions.
- A problem was fixed for the ODL link in Open CAPI in
the case where ODL Link 1 (ODL1) is used and ODL Link 0 (ODL0) is not
used. As a circumvention, the errors are resolved if ODL 0 is
used instead, or in conjunction with the ODL1.
- A problem was fixed for the wrong DIMM being called out on
an over-temperature error with a SRC B1xx2A30 error log.
- A problem was fixed for adding a non-cable PCIe card
into a slot that was previously occupied by a PCIe3 Optical or Copper
Cable Adapter for the PCIe3 Expansion Drawer
The PCIe new card could fail with a I2C error with SRC BC100706
logged.
- A problem was fixed for call home data for On-Chip
Controller (OCC) error log sensor data being off in alignment by one
sensor. By visually shifting the data, the valid data values can
still be determined from the call home logs.
- A problem was fixed for slow hardware dumps that include
failed processor cores that have no clock signal. The dump
process was waiting for core responses and had to wait for a time-out
for each chip operation, causing dumps to take several hours.
With the fix, the core is checked for a proper clock, and if one does
not exist, the chip operations to that core are skipped to speed up the
hardware dump process significantly.
- A problem was fixed for ipmitool not being able to set the
system power limit when the power limit is not activated with the
standard option. With the fix, the ipmitool user can
activate the power limit "dcmi power activate" and then set the power
limit "dcmi power set _limit xxxx" where "xxxx" in the new
power limit in Watts.
- A problem was fixed for the OBUS to make it OpenCAPI
capable by increasing its frequency from 1563 Mhz to 1611 Mhz.
- A problem was fixed for a Workload Optimized Frequency
(WOF) reset limit failure not providing an Unrecoverable Error (UE) and
a callout for the problem processor. When the WOF reset limit is
reached and failed, WOF is disabled and the system is not running at
optimized frequencies.
- A problem was fixed for the callout of SRC BA188002 so it
does display three trailing extra garbage characters in the location
code for the FRU. The string is correct up to the line ending
white space, so the three extra characters after that should be
ignored. This problem is intermittent and does not occur for all
BA188002 error logs.
- A problem was fixed for the callout of scan ring failures
with SRC BC8A285E and SRC BC8A2857 logged but with no callout for the
bad FRU.
- A problem was fixed for the On-Chip Controller (OCC)
possibly timing out and going to Safe Mode when a system is changed
from the default maximum performance mode (Workload Optimized Frequency
(WOF) enabled) to nominal mode (WOF disabled) and then back to maximum
performance (WOF enabled again). Normal performance can be
recovered with a re-IPL of the system.
- A problem was fixed for the periodic guard reminder causing
a reset/reload of the service processor when it found a symbolic FRU
with no CCIN value in the list of guarded FRUs for the
system. Periodically as periodic guard reminder is run,
every 30 days by default, this problem can cause recoverable errors on
the service processor but with no interruption to the workloads on the
running partitions.
- A problem was fixed for a wrong SubSystem being logged in
the SRC B7009xxxx for Secure Boot Errors. "I/O Subsystem" is
displayed instead of the correct SubSystem value of "System Hypervisor
Firmware".
- A problem was fixed for the lost recovery of a failed Self
Boot Engine (SBE). This may happen if the SBE recovery occurs
during a reset of the service processor. Not only is the recovery
lost, but the error log data for the SBE failure may also be not be
written to the error log. If the SBE is failed and not recovered,
this can cause the post-dump IPL after a system Terminate
Immediate (TI) error to not be able to complete. To recover,
power off the system and IPL again.
- A problem was fixed for a missing SRC at the time runtime
diagnostics are lost and the Hostboot runtime services (HBRT) are put
into the failed state.
A B400F104 SRC is logged each time the HBRT hypervisor adjunct
crashed. On the fourth crash in one hour, HBRT is failed with no
further retries but no SRC is logged. Although a unique SRC is
not logged to indicate loss of runtime diagnostic capability, the
B400F104 SRC does include the HBRT adjunct partition ID for Service to
identify the adjunct.
- A problem was fixed for a Novalink enabled partition not
being able to release master from the HMC that results in error
HSCLB95B. To resolve the issue, run a rebuild managed server
operation on the HMC and then retry the release. This occurs when
attempting to release master from HMC after the first boot up of a
Novalink enabled partition if Master Mode was enforced prior to the
boot.
- A problem was fixed for an UE memory error causing an
entire LMB of memory to deallocate and guard instead of just one page
of memory.
- A problem was fixed for all variants (this was partially
fixed in an earlier release) for the SR-IOV firmware adapter updates
using the HMC GUI or CLI to only reboot one SR-IOV adapter at a
time. If multiple adapters are updated at the same time, the HMC
error message HSCF0241E may occur: "HSCF0241E Could not read
firmware information from SR-IOV device ...". This fix prevents
the system network from being disrupted by the SR-IOV adapter updates
when redundant configurations are being used for the network. The
problem can be circumvented by using the HMC GUI to update the SR-IOV
firmware one adapter at a time using the following steps: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
- A problem was fixed for a rare hypervisor hang caused by a
dispatching deadlock for two threads of a process. The system
hangs with SRC B17BE434 and SRC B182951C logged. This
failure requires high interrupt activity on a program thread that is
not currently dispatched.
- A problem was fixed for a Virtual Network Interface
Controller (vNIC) client adapter to prevent a failover when disabling
the adapter from the HMC. A failover to a new backing device
could cause the client adapter to erroneously appear to be active again
when it is actually disabled. This causes confusion and failures
on the OS for the device driver. This problem can only occur when
there is more than a single backing device for the vNIC adapter and if
a commands are issued from the HMC to disable the adapter and enable
the adapter.
- A possible performance problem was fixed for workloads that
have a large memory footprint.
- A problem was fixed for error recovery in the timebase
facility to prevent an error in the system time. This is an
infrequent secondary error when the timebase facility has failed
and needs recovery.
- A problem was fixed for the HMC GUI and CLI interfaces
incorrectly showing SR-IOV updates as being available for certain
SR-IOV adapters when no updates are
available. This affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC. The "Update
Available" indication in the HMC can be ignored if updates have already
been applied.
- A problem was fixed for the recovery of certain SR-IOV
adapters that fail with SRC B400FF05. This is
triggered by infrequent EEH errors in the adapter. In the
recovery process, the Virtual Function (VF) for the adapter
is rebuilt into the wrong state, preventing the adapter from
working. An HMC initiated disruptive resource dump of the adapter
can recover it. This problem affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV Virtual Functions (VFs)
halting transmission with a SRC B400FF01 logged when many logical
partitions with VFs are shutdown at the same time the adapter is in
highly-active usage by a workload. The recovery process reboots
the failed SR-IOV adapter, so no user intervention is needed to restore
the VF.
- A problem was fixed for VLAN-tagged frames
being transmitted over SR-IOV adapter VFs when the packets should have
instead have been discarded for some VF configuration settings on
certain SR-IOV adapters. This affects the following PCIe
adapters:
#EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; and
#EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV adapter hangs with a
possible SRC B400FF01 logged. This may cause a temporary network
outage while the SR-IOV adapter VF reboots to recover from thje adapter
hang. This problem has been observed on systems with high
network traffic and with many VFs defined.
This fix updates adapter firmware to 1x.22.4021 for the
following Feature Codes: EC2R, EC2S, EC2T, EC2U, EC3L and EC3M.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for a large number (approximately
16,000) of DLPAR adds and removes of SR-IOV VFs to
cause a subsequent DLPAR add of the VF to fail with the
newly-added VF not usable. The large number of
allocations and deallocations caused a leak of a critical SR-IOV
adapter resource. The adapter and VFs may be recovered by an
SR-IOV adapter reset.
- A problem was fixed for a system boot hanging when
recoverable attentions occur on the non-master processor. With
the fix, the attentions on the non-master processor are deferred until
Symmetric multiprocessing (SMP) mode has been established (the point at
which the system is ready for multiple processors to run). This
allows the boot to complete but still have the non-master processor
errors recovered as needed.
- A problem was fixed for certain hypervisor error logs being
slow to report to the OS. The error logs affected are those
created by the hypervisor immediately after the hypervisor is started
and if there is more than 128 error logs from the hypervisor to be
reported. The error logs at the end of the queue take a long time
to be processed, and may make it appear as if error logs are not being
reported to the OS.
- A problem was fixed for a Self Boot Engine (SBE) reset
causing the On-Chip Controller (OCC) to force the system into
Safe Mode with a flood of SRC B150DAA0 and SRC B150DA8A written to the
error log as Information Events.
- A problem was fixed for the Redfish "Manager" request
returning duplicate object URIs for the same HMC. This can occur
if the HMC was removed from the managed system and then later added
back in. The Redfish objects for the earlier instances of the
same HMC were never deleted on the remove.
- A problem was fixed for a possible failure to the service
processor stop state when performing a platform dump. This
problem is specific to dumps being collected for HWPROC
checkstops, which are not common.
- A problem was fixed for SMS menus to limit reporting on the
NPIV and vSCSI configuration to the first 511 LUNs. Without the
fix, LUN 512 through the last configured LUN report with invalid
data. Configurations in excess of 511 LUNs are very rare, and it
is recommended for performance reasons (to be able search for the boot
LUN more quickly) that the number of LUNs on a single targeted be
limited to less than 512.
- The following two errors in the SR-IOV adapter firmware
were fixed: 1) The adapter resets and there is a B400FF01
reference code logged. This error
happens in rare cases when there are multiple partitions actively
running traffic through the adapter. System firmware resets the
adapter
and recovers the system with no
user-intervention required; 2) SR-IOV VFs with defined VLANs and an
assigned PVID are not able to ping each other.
This fix updates adapter firmware to 11.2.211.26 for the following
Feature Codes: EN15, EN17, EN0H,
EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for Field Core Override (FCO) cores
being allocated from a deconfigured processor, causing an IPL failure
with unusable cores. This problem only occurs during the Hostboot
reconfiguration loop in the presence of other processor failures.
- A problem was fixed for a failure in DDR4 RCD (Register
Clock Driver) memory initialization that causes half of the DIMM memory
to be unusable after an IPL. This is an intermittent problem
where the memory can sometimes be recovered by doing another IPL.
The error is not a hardware problem with the DIMM but it is an error in
the initialization sequence needed get the DIMM ready for normal
operations. This supercedes an earlier fix delivered in FW910.01
that intermittently failed to correct the problem.
- A problem was fixed for IBM Product Engineering and Support
personnel not being able to easily determine planar jumper settings in
a machine in order to determine the best mitigation strategies for
various field problems that may occur. With the fix, an
Information Error log is provided on every IPL to provide the planar
jumper settings.
- A problem was fixed for the periodic guard reminder
function to not re-post errorlogs of failed FRUs on each IPL.
Instead, a reminder SRC is created to call home the list of FRUs that
have failed and require service. This puts the system to back to
original behavior of only posting one error log for each FRU that has
failed.
- For a HMC managed system, a problem was fixed for a rare,
intermittent NetsCMS core dump that could occur whenever the system is
doing a deferred shutdown power off. There is no impact to normal
operations as the power off completes, but there are extra error logs
with SRC B181EF88 and a service processor dump.
- A problem was fixed for a Hostboot hang due to deadlock
that can occur if there is a SBE dump in progress that fails. A
failure in the SBE dump can trigger a second SBE dump that deadlocks.
- A problem was fixed for dump performance by decreasing the
amount of time needed to perform dumps by 50%.
- A problem was fixed for an IPL hang that can occur for
certain rare processor errors, where the system is in a loop trying to
isolate the fault.
- A problem was fixed for an enclosure fault LED being stuck
on after a repair of a fan. This problem only occurs after the
second concurrent repair of a fan.
- A problem was fixed for SR-IOV adapters not showing up in
the device tree for a partition that autotboots or starts within a few
seconds of the hypervisor going ready. This problem can be
circumvented by delaying the boot of the partition for at least a
minute after the hypervisor has reached the standby state. If the
problem is encountered, the SR-IOV adapter can be recovered by
rebooting the partition, or DLPAR and remove and add the SR-IOV adapter
to the partition.
- A problem was fixed for a system crash with SRC B700F103
when there are many consecutive configuration changes in the LPARs to
delete old vNICs and create new vNICs, which exposed an infrequent
problem with lock ownership on a virtual I/O slot. There is a
one-to-one mapping or connection between vNIC adapter in the client
LPAR and the backing logical port in the VIOS, and the lock management
needs to ensure that the LPAR accessing the port has ownership to
it. In this case, the LPAR was trying to make usable a device it
did not own. The system should recover on the post dump IPL.
- A problem was fixed for a possible DLPAR add failure of a
PCIe adapter if the adapter is in the planar slot C7 or slot C6 on any
PCIe Expansion drawer fanout module. The problem is more common
if there are other devices or Virtual Functions (VFs) in the same LPAR
that use four interrupts, as this is a problem with the processing
order of the PCIe LSI interrupts.
- A problem was fixed for resource dumps that use the
selector "iomfnm" and options "rioinfo" or "dumpbainfo". This
combination of options for resource dumps always fails without the fix.
- A problem was fixed for missing FFDC data for SR-IOV
Virtual Function (VF) failures and for not allowing the full
architected five minute limit for a recovery attempt for the VF, which
should expand the number of cases where the VF can be recovered.
- A problem was fixed for missing error recovery for memory
errors in non-mirrored memory when reading the SR-IOV adapter firmware,
which could prevent the SR-IOV VF from booting.
- A problem was fixed for a possible system crash if an error
occurs at runtime that requires a FRU guard action. With the fix,
the guard action is restricted to the IPL where it is supported.
- A problem was fixed for a extremely rare IPL hang on a
false communications error to the power supply. Recovery is to
retry the IPL.
- A problem was fixed for the dump content type for HBMEM
(Hostboot memory) to be recognized instead of displaying "Dump Content
Type: not found".
- A problem was fixed for a system crash when an SR-IOV
adapter is changed from dedicated to shared mode with SRC B700FFF and
SRC B150DA73 logged. This failure requires that hypervisor
memory relocation be in progress on the system. This affects the
following PCIe adapters: #EC2R/#EC2S with CCIN 58FA;
#EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for a Live Partition Mobility (LPM)
migration of a partition with shared processors that has an unusable
shared processor that can result in failure of the target partition or
target system. This problem can be avoided by making sure all
shared processors are functional in the source partition before
starting the migration. The target partition or system can be
rebooted to recover it.
- A problem was fixed for hypervisor memory relocation and
Dynamic DMA Window (DDW) memory allocation used by I/O adapter slots
for some adapters where the DDW memory tables may not be fully
initialized between uses. Infrequently, this can cause an
internal failure in the hypervisor when moving the DDW memory for the
adapters. Examples of programs using memory relocation are
Live Partition Mobility (LPM) and the Dynamic Platform Optimizer (DPO).
- A problem was fixed for a partition or system termination
that may occur when shutting down or deleting a partition on a system
with a very large number of partitions (more than 400) or on a system
with fewer partitions but with a very large number of virtual adapters
configured.
- A problem was fixed for when booting a large number of
LPARs with Virtual Trusted Platform Module (vTPM) capability, some
partitions may post a SRC BA54504D time-out for taking too long to
start. With the fix, the time allowed to boot a vTPM LPAR is
increased. If a time-out occurs, the partition can be booted
again to recover. The problem can be avoided by auto-starting
fewer vTPM LPARs, or booting them a couple at a time to prevent
flooding the vTPM device server with requests that will slow the boot
time while the LPARs wait on the vTPM device server responses.
- A problem was fixed for a possible system crash.
- A problem was fixed for a UE B1812D62 logged when a PCI
card is removed between system IPLs. This error log can be
ignored.
- A problem was fixed for USB code update failure if the USB
stick is plugged during an AC power cycle. After the power cycle
completes, the code update will fail to start from the USB
device. As a circumvention, the USB device can be plugged in
after the service processor is in its ready state.
- A problem was fixed for a possible slower migration during
the Live Partition Mobility (LPM) resume stage. For a
migrating partition that does not have a high demand page rate, there
is minimal impact on performance. There is no need for customer
recovery as the migration completes successfully.
- A problem was fixed for firmware assisted dumps (fadump)
and Linux kernel crash dumps (kdump) where dump data is missing.
This can happen if the dumps are set up with chunks greater than 1
Gb in size. This problem can be avoided by setting up
fadump or kdump with multiple 1 Gb chunks.
- A problem was fixed for the I2C bus error logged with SRC
BC500705 and SRC BC8A0401 where the I2C bus was locked up. This
is an infrequent error. In rare cases, the TPM device may hold down the
I2C clock line longer than allowed, causing an error recovery that
times out and prevents the reset from working on all the I2C engine's
ports. A power off and power on of the system should clear the
bus error and allow the system to IPL.
- A problem was fixed for an intra-node, inter-processor
communication lane failure marked in the VPD, causing a secure boot
blacklist violation on the IPL and a processor to be deconfigured with
an SRC BC8A2813 logged.
- A problem was fixed to capture details of failed FRUs into
the dump data by delaying the deconfiguration of the FRUs for checkstop
and TI attentions.
- A problem was fixed for failed processor cores not being
guarded on a memory preserving IPL (re-IPL with CEC powered on).
- A problem was fixed for debug data missing in dumps for
cores which are off-line.
- A problem was fixed for L3 cache calling out a LRU Parity
error too quickly for hardware that is good. Without the fix,
ignore the L3FIR[28] LRU Parity errors unless they are very persistent
with 30 or more occurrences per day.
- A problem was fixed for not having a FRU callout when the
TPM card is missing and causes an IPL failure.
- A problem was fixed for the Advanced System Management
Interface (ASMI) displaying the IPv6 network prefix in decimal instead
of hex character values. The service processor command line
"ifconfig" can be used to see the IPv6 network prefix value in hex as a
circumvention to the problem.
- A problem was fixed for an On-Chip Controller (OCC) cache
fault causing a loss of the OCC for the system without the system
dropping into Safe mode.
- A problem was fixed for system dump failing to collect the
pu.perv SCOMs for chiplets c16 and above which correspond to EQ and EC
chiplets.
Also fixed was the missing SCOM data for the interrupt unit related
"c_err_rpt" registers.
- A problem was fixed for the PCIe topology reports having
slots missing in the "I/O Slot Locations" column in the row for the bus
representing a PCIe switch. This only occurs when the C49
or C50 slots are bifurcated (a slot having two channels).
Bifurcation is done if an NVME module is in the slot or if the slot is
empty (for certain levels of backplanes).
- A problem was fixed for Live Partition Mobility (LPM)
failing along with other hypervisor tasks, but the partitions continue
to run. This is an extremely rare failure where a re-IPL is
needed to restore HMC or Novalink connections to the partitions, or to
do any system configuration changes.
- A problem was fixed for a system termination during a
concurrent exchange of a SR-IOV adapters that had VFs assigned to
it. For this problem, the OS failed to release the VFs but the
error was not returned to the HMC. With the fix, the FRU exchange
gracefully aborts without impacting the system for the case where the
VFs on the SR-IOV adapter remain active.
- A possible performance problem was fixed for partitions
with shared processors that had latency in the handling of the
escalation interrupts used to switch the processor between tasks.
The effect of this is that, while the processor is kept busy, some
tasks might hold the processor longer than they should because the
interrupt is delayed, while others run slower than normal.
- A problem was fixed for a system termination that may occur
with B111E504 logged when starting a partition on a system with a very
large number of partitions (more than 400) or on a system with fewer
partitions but with a very large number of virtual adapters configured.
- A problem was fixed for a system termination that may occur
with a B150DA73 logged when a memory UE is encountered in a partition
when the hypervisor touches the memory. With the fix, the touch
of memory by the hypervisor is a UE tolerant touch and the system is
able to continue running.
- A problem was fixed for fabric errors such as cable pulls
causing checkstops. With the fix, the PBAFIR are changed to
recoverable atentions, allowing the OCC to be reset to recover from
such faults
System firmware changes that affect certain systems
- A problem was fixed to remove a SAS battery LED from ASMI
that does not exist. This problem only pertains to the
S914(9009-41A), S924 (9009-42A) and H924 for SAP HANA (9223-42H) models.
- On a system with an AIX partition, a problem was
fixed for a partition time jump that could occur after doing an AIX
Live Update. This problem could occur if the AIX Live Update
happens after a Live Partition Mobility (LPM) migration to the
partition. AIX applications using the timebase facility could
observe a large jump forwards or backwards in the time reported by the
timebase facility. A circumvention to this problem is to
reboot the partition after the LPM operation prior to doing the AIX
Live Update. An AIX fix is also required to resolve this
problem. The issue will no longer occur when this firmware update
is applied on the system that is the target of the LPM operation and
the AIX partition performing the AIX Live Update has the appropriate
AIX updates installed prior to doing the AIX Live Update.
- On a Linux or IBM i partition which has just
completed a Live Partition Mobility (LPM) migration, a problem was
fixed for a VIO adapter hang when it stops processing interrupts.
For this problem to occur, prior to the migration the adapter must have
had a interrupt outstanding where the interrupt source was disabled.
- On systems with an IBM i partition, support was added
for multipliers for IBM i MATMATR fields that are limited to four
characters. When retrieving Server metrics via IBM MATMATR calls,
and the system contains greater than 9999 GB, for example, MATMATR has
an architected "multiplier" field such that 10,000 GB can be
represented
by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in
the quantity and multiplier fields, respectively, to handle these
extended values. The IBM i OS also requires a PTF to support the
MATMATR field multipliers.
- On a system with a IBM i partition with more than 64 shared
processors assigned to it, a problem was fixed for a system
termination or other unexpected behavior that may occur during a
partition dump. Without the fix, the problem can be avoided by
limiting the IBM i partition to 64 or fewer shared processors.
|