SC840
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The following Fix description table will
only contain the N (current) and N-1 (previous) levels.
The complete Firmware Fix History
(including HIPER descriptions) for
this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
|
SC840_177_056 / FW840.60
09/29/17 |
Impact: Availability
Severity: SPE
System firmware changes that affect all systems
- A problem was fixed for a false 110026B1 (12V power good
failure) caused by an I2C bus write error for a LED state. This
error can be triggered by the fan LEDs changing state.
- A problem was fixed for a fan LED turning amber on solid
when there is no fan fault, or when the fan fault is for a different
fan. This error can be triggered anytime a fan LED needs to
change its state. The fan LEDs can be recovered to a normal state
concurrently using the following link steps for a soft reset of the
service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for sporadic blinking amber LEDs for
the system fans with no SRCs logged. There was no problem with
the fans. The LED corruption occurred when two service processor
tasks attempted to update the LED state at the same time. The fan
LEDs can be recovered to a normal state concurrently using the
following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for the loss of Operations Panel
function 30 (displaying ethernet port HMC1 and HMC2 IP addresses)
after a concurrent repair of the Operations Panel.
Operations Panel function 30 can be restored concurrently using
the following link steps for a soft reset of the service
processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for a core dump of the rtiminit
(service processor time of day) process that logs an SRC B15A3303
and could invalidate the time on the service processor. If the
error occurs while the system is powered on, the hypervisor has the
master time and will refresh the service processor time, so no action
is needed for recovery. If the error occurs while the system is
powered off, the service processor time must be corrected on the
systems having only a single service processor. Use the following
steps from the IBM Knowledge Center to change the UTC time with the
Advanced System Management Interface: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/viewtime.htm.
- A problem was fixed for the "Minimum code level supported"
not being shown by the Advanced System Menu Interface when selecting
the "System Configuration/Firmware Update Policy" menu. The
message shown is "Minimum code level supported value has not been
set". The workaround to find this value is to use the ASMI
command line interface with the "registry -l cupd/MinMifLevel" command.
- A problem was fixed for a degraded PCI link causing a
Predictive SRC for a non-cacheable unit (NCU) store time-out that
occurred with SRC B113E540 or B181E450 and PRD signature
"(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB". With the fix,
the error is changed to be an Informational as the problem is not with
the processor core and the processor should not be replaced. The
solution for degraded PCI links is different from the fix for this
problem, but a re-IPL of the CEC or a reset of the PCI adapters could
help to recover the PCI links from their degraded mode.
- A problem was fixed for system node fans going to maximum
RPM speeds after a service processor failover that needed the On-Chip
Controllers (OCC) to be reloaded. Without the fix, the system
node fan speeds can be restored to normal speed by changing the Power
Mode in the Advanced System Menu Interface using steps from the IBM
Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/areaa_pmms.htm.
After changing the Power Mode, wait about 10 minutes to change the
Power Mode back to the original setting.
If the fix is applied concurrently and the fans are already in the
maximum RPM speed condition, the system node fan speeds can be
corrected by either changing the Power Mode as above, or using the HMC
to do an Administrative Failover (AFO).
- A problem was fixed for the System Attention LED failing to
light for an error failover for the redundant service processors with
an SRC B1812028 logged.
- A problem was fixed for a service processor reset triggered
by a spurious false IIC interrupt request in the kernel. On
systems with a single service processor, the SRC B1817201 is displayed
on the Operator Panel. For systems with redundant service
processors, an error failover to the backup service processor
occurs. The problem is extremely infrequent and does not impact
processes on the running system.
- A problem was fixed for the service processor low-level
boot code always running off the same side of the flash image,
regardless of what side has been selected for boot ( P-side or
T-side). Because this low-level boot code rarely changes, this
should not cause a problem unless corruption occurs in the flash image
of the boot code. This problem does not affect firmware
side-switches as the service processor initialization code
(higher-level code than the boot code) is running correctly from
the selected side. Without the fix, there is no recovery for boot
corruption for systems with a single service processor as the service
processor must be replaced.
- A problem was fixed for a system failure caused by Hostboot
problems with one node but where the other nodes are good. With
the fix, the node that is failing the Hostboot is deconfigured and the
system is able to IPL on the remaining nodes. To recover from
this problem, manually guard the node that is failing and re-IPL.
- A problem was fixed for help text in the Advanced System
Management Interface (ASMI) not informing the user that system fan
speeds would increase if the system Power Mode was changed to "Fixed
Maximum Frequency" mode. If ASMI panel function "System
Configuration->Power Management->Power Mode Setup" "Enable Fixed
Maximum Frequency mode" help is selected, the updated text states
"...This setting will result in the fans running at the maximum speed
for proper cooling."
- A problem was fixed for a Power Supply Unit (PSU) failiure
of SRC 110015xF logged with a power supply fan call out
when doing a hot re-plug of a PSU. The power supply may be
made operational again by doing a dummy replace of the PSU that was
called out (keeping the same PSU for the replace operation). A
re-IPL of the system will also recover the PSU.
- A problem was fixed for recovery from clock card loss of
lock failures that resulted in a clock card FRU unnecessarily being
called out for repair. This error happened whenever there was a
loss of lock (PLL or CRC) for the clock card. With the fix,
firmware will not be calling out the failing clock card, but rather it
will be re-configured as the new backup clock card after doing a clock
card failover. Customers will see a benefit from improved system
availability by the avoidance of disruptive clock card repairs.
System firmware changes that affect certain systems
- DEFERRED: On systems using
PowerVM firmware, a problem was fixed for PCIe3 I/O expansion drawer
(#EMX0) link improved stability. The settings for the continuous
time linear equalizers (CTLE) was updated for all the PCIe adapters for
the PCIe links to the expansion drawer. The CEC must be re-IPLed
for the fix to activate.
- On systems using PowerVM firmware, a problem was
fixed for an intermittent service processor core dump and callout for
netsCommonMSGServer with SRC B181EF88. The HMC connection
to the service processor automatically recovers with a new session.
- On systems using PowerVM firmware with a Linux Little
Endian (LE) partition, a problem was fixed for system reset interrupts
returning the wrong values in the debug output for the NIP and MSR
registers. This problem reduces the ability to debug hung Linux
partitions using system reset interrupts. The error occurs every
time a system reset interrupt is used on a Linux LE partition.
- On systems using PowerVM firmware, a problem was fixed for
"Time Power On" enabled partitions not being capable of suspend and
resume operations. This means Live Partition Mobility (LPM) would
not be able to migrate this type of partition. As a workaround,
the partition could be transitioned to a "Non-time Power On" state and
then made capable of suspend and resume operations.
- On systems using PowerVM firmware, a problem was
fixed for Power Enterprise Pool (PEP) IFL processors assignments
causing an "Out of Compliance" for normal processor licenses. The
number of IFL processors purchased was first credited as satisfying any
"unreturned" PEP processor resources, thus potentially leaving the
system "Out Of Compliance" since IFL processors should not be taking
the place of the normal (expensive) processor usage. In this
situation, without the fix, the user will need to either purchase more
"expensive" non-IFL processors to satisfy the non-IFL workloads or
adjust the partitions to reduce the usage of non-IFL processors.
This is a very infrequent problem for the following reasons:
1) PEP processors are infrequently left "unreturned" for short periods
of time for specialized operations such as LPM migrations
2) The user would have to purchase IFL processors from IBM, which is
not a common occurrence.
3) The user would have to put in a COD key for IFL processors while a
PEP processor is still "unreturned"
- On systems using PowerVM firmware, a problem was fixed for
a Power Enterprise Pool (PEP) resource Grace Period being short by one
hour with 71 hours provided instead of 72 hours. The Grace Period
is provided when all PEP resources are assigned and the user
double-uses these resources (typically this is done for a Live
Partition Mobility (LPM) migration). This "borrowing" is
temporarily permitted in this case even if there are not enough
licenses to cover resources in both servers. The PEP goes into
"Approaching Out Of Compliance", indicating the user has a certain
amount of time to resolve this double-use. The problem here is that the
time length of this Grace Period lasts one hour less than stated.
For a 72-hour Grace Period (the standard setting), the user only gets
71 hours. The user sees "71 hours remaining" (correct) on first
display at start, then right away, if the user displays again, 70
hours is shown remaining. But thereafter, the Grace Period time
decrements correctly for the time remaining.
- On systems using PowerVM firmware, a problem was fixed for
Power Enterprise Pool (PEP) non-applicable error messages being
displayed when re-entering PEP XML files for PEP updates, in which one
of the XML operations calls for Conversion of Perm Resources to PEP
Resources. There is no error as the PEP key was accepted on the
first use. The following message may be seen on the HMC and can
be ignored: "...HSCL0520 A Mobile CoD processor conversion
code to convert 0 permanently activated processors to Mobile CoD
processors on the managed system has been entered. HSCL050F This
CoD code is not valid for your managed system. Contact your CoD
administrator."
- On systems using PowerVM firmware, a problem was fixed for
reboot retries for IBM i partitions such that the first load source I/O
adapter (IOA) is retried instead of bypassed after the first failed
attempt. The reboot retries are done for an hour before the
reboot process gives up. This error can occur if there is more
than one known load source, and the IOA of the first load source is
different from the IOA of the last load source. The error can be
circumvented by retrying the boot of the partition after the load
source device has become available.
- On systems using PowerVM firmware with mirrored memory
running IBM i partitions, a problem was fixed for memory fails in the
partition that also caused the system to crash. The system
failure will occur any time that IBM i partition memory towards the
beginning of the partition's assigned memory fails. With the fix,
the memory failure is isolated to the impacted partition, leaving the
rest of the system unaffected.
- On systems using PowerVM firmware, a problem was fixed for
failures deconfiguring SR-IOV Virtual Functions (VFs). This can
occur during Live Partition Mobility (LPM) migrations with HMC error
messages of HSCLAF16,HSCLAF15 and HSCLB602 shown This
results in a LPM migration failure and a system reboot is required to
recover the VFs for the I/O adapters. This error may occur more
frequently in cases where the I/O adapter has pending I/O at the time
of the deconfigure request for the VF.
- On systems using PowerVM firmware, a problem was fixed for
the incorrect reporting of the Universally Unique Identifier (UUID) to
the OS, which prevented the tracking of a partition as it moved within
a data center. The UUID value as seen on HMC or the NovaLink did
not match the value as displayed in the OS.
- On systems using PowerVM firmware, a problem
was fixed for a partition boot from a USB 3.0 device that has an error
log SRC BA210003. The error is triggered by an Open Firmware
entry to the trace buffer during the partition boot. The error
log can be ignored as the boot is successful to the OS.
- On systems using PowerVM firmware, a problem
was fixed for a partition boot fail or hang from a Fibre Channel device
having fabric faults. Some of the fabric errors returned by the
VIOS are not interpreted correctly by the Open Firmware VFC drive,
causing the hang instead of generating helpful error logs.
- On systems using PowerVM firmware, problems were
fixed for communication failures on adapters in SR-IOV shared mode:
1) A problem was fixed for SR-IOV adapters in shared mode for a
transmission stall or time out with SRC B400FF01 logged. The time
out happens during Virtual Function (VF) shutdowns and during Function
Level Resets (FLRs) with network traffic running.
2) A problem was fixed for an SR-IOV logical port whose Port VLAN ID
(PVID) changing from non-zero to zero causes a communication failure
under certain conditions. The communication failure only occurs
when a logical port's PVID is dynamically changed from non-zero to
zero. An SR-IOV logical port is an I/O device created for a
partition or a partition profile using the management console (HMC)
when a user intends for the partition to access an SR-IOV adapter
Virtual Function. The error can be recovered from by a reboot of
the partition.
These fixes updates adapter firmware to 10.2.252.1929, for the
following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M,
EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using PowerVM firmware with PowerVM NovaLink, a
problem was fixed for a lost of a communications channel between the
hypervisor and the PowerVM NovaLink during a reset of the service
processor. Various NovaLink tasks, including deploy, could fail
with a "No valid host was found" error. With the fix, PowerVM
NovaLink prevents normal operations from being impacted by a reset of
the service processor.
- On systems using PowerVM firmware with PowerVM NovaLink, a
problem was fixed for returning to HMC-only management from
co-management when a Novalink partition is deleted holding the
master mode. A circumvention is to release master mode before
deleting the NovaLink partition and then reconnect the disconnected
management console. Please refer to IBM Knowledge Center link "http://ibm.biz/novalink-kc" for
more information on the PowerVM NovaLink feature and changing the
master authority when doing co-management.
- On systems using PowerVM firmware with PowerVM NovaLink, a
problem was fixed for a master management console becoming disconnected
and blocking other management consoles from performing virtualization
changes. A circumvention is to use the HMC CLI on another management
console to request the master mode with the force option.
Please refer to IBM Knowledge Center link "http://ibm.biz/novalink-kc" for
more information on the PowerVM NovaLink feature and changing the
master authority when doing co-management.
- On systems using PowerVM firmware, a problem was fixed for
Power Enterprise Pool (PEP) busy errors from the system anchor card
when creating or updating a PEP pool. The error
returned by the HMC is "HSCL9015 The managed system cannot currently
process this operation. This condition is temporary. Please
try the operation again." To try again, the customer needs to
update the pool again. Typically on the second PEP update, the
code is accepted.
The problem is intermittent and occurs only rarely.
- On systems using PowerVM firmware, a problem was fixed for
an invalid date from the service processor causing the customer date
and time to go to the Epoch value (01/01/1970) without a warning or
chance for a correction. With the fix, the first IPL
attempted on an invalid date will be rejected with a message alerting
the user to set the time correctly in the service processor. If
the warning is ignored and the date/time is not corrected, the next IPL
attempt will complete to the OS with the time reverted to the Epoch
time and date. This problem is very rare but it has been known to
occur on service processor replacements when the repair step to set the
date and time on the new service processor was inadvertently skipped by
the service representative.
- On systems using PowerVM firmware, a problem was fixed for
a Power Enterprise Pool (PEP) system losing its assigned processor and
memory resources after an IPL of the system. This is an
intermittent problem caused by a small timing window that makes it
possible for the server to not get the IPL-time assignment of resources
from the HMC. If this problem occurs, it can be corrected by the
HMC to recover the pool without needing another IPL of the system.
- On systems using PowerVM firmware, a problem was
fixed for the error handling of EEH events for the SR-IOV Virtual
Functions (VFs) that can result in IPL failure with B7006971, B400FF05,
and BA210000 SRCs logged. In these cases, the partition console
stops at an OFDBG prompt. Also a DLPAR add of a VF may result in
a parttion crash due to a 300 DSI exception because of a low-level EEH
event. A circumvention for the problem would be to debug the EEH
events which should be recovered errors and eliminate the cause of the
EEH events. With the fix, the EEH events still log Predictive
Errors but do not cause a partition failure.
- On systems using PowerVM firmware, a problem was fixed for
an error finding the partition load source that has a GPT format.
GUID Partition Table (GPT) is a standard for the layout of the
partition table on a physical storage device used in the server, such
as a hard disk drive or solid-state drive, using globally unique
identifiers (GUID). Other drives that are working may be using
the older master boot record (MBR) partition table format. This
problem occurs whenever load sources utilizing the GPT format occur in
other than the first entry of the boot table. Without the fix, a
GPT disk drive must be the first entry in the boot table to be able to
use it to boot a partition.
- On systems using PowerVM firmware, a problem was fixed for
an SRC BA090006 serviceable event log occurring whenever an attempt was
made to boot from an ALUA (Asymmetric Logical Unit Access)
drive. These drives are always busy by design and cannot be used
for a partition boot, but no service action is required if a user
inadvertently tries to do that. Therefore, the SRC was changed to
be an informational log.
- On systems using PowerVM firmware, a problem was fixed for
Live Partition Mobility (LPM) migrations from FW860.12 or later to the
FW840.50 level of firmware. Subsequent DLPAR add operations of Virtual
Adapters will fail with HMC error message HSCLAB2B, which contains text
similar to the following: "The operation to add a virtual NIC in
slot 8 on partition 9 failed. The requested amounts of slot(s) to be
added is 1 and the completed amount is 0." The AIX OS
standard error message with return code 3 is the following: "0931-007
You have specified an invalid drc_name." This issue affects
partitions installed with AIX 7.2 TL 1 and later. Not
affected by this issue are partitions installed with VIOS, IBM i, or
earlier levels of AIX. The error can be recovered by a reboot of
the affected partition.
|
SC840_168_056 / FW840.50
04/21/17 |
Impact: Availability
Severity: SPE
New features and functions
- Support for the Advanced System Management Interface (ASMI)
was
changed to allow the special characters of "I", "O", and "Q" to be
entered for the serial number of the I/O Enclosure under the Configure
I/O Enclosure option. These characters have only been found in an
IBM
serial number rarely, so typing in these characters will normally be an
incorrect action. However, the special character entry is not
blocked
by ASMI anymore so it is able to support the exception case.
Without
the enhancement, the typing of one of the special characters causes
message "Invalid serial number" to be displayed.
- On systems using PowerVM firmware, support was added
for the Universally Unique IDentifier (UUID) property for each
partition. The UUID provides each partition with an identifier
that is persisted by the platform across partition reboots,
reconfigurations, OS reinstalls, partition migration, and
hibernation.
System firmware changes that affect all systems
- A problem was fixed for the setting the disable of a
periodic
notification for a call home error log SRC B150F138 for Memory Buffer
resources (membuf) from the Advanced System Management Interface (ASMI).
- A
problem was fixed for incorrect callouts of the Power Management
Controller (PMC) hardware with SRC B1112AC4 and SRC B1112AB2
logged.
These extra callouts occur when the On-Chip Controller (OCC) has placed
the system in the safe state for a prior failure that is the real
problem that needs to be resolved.
- A problem was fixed for device time outs during a IPL
logged with
a SRC B18138B4. This error is intermittent and no action is
needed for
the error log. The service processor hardware server has allotted
more
time of the device transactions to allow the transactions to complete
without a time-out error.
- A problem was fixed for the Advanced System Management
Interface (ASMI) "System Service Aids => Error/Event Logs" panel not
showing the "Clear" and "Show" log options and also having a truncated
error log when there are a large number of error logs on the system.
- A problem was fixed for the failover to the backup PNOR on
a
Hostboot Self Boot Engine (SBE) failure. Without the fix, the
failed
SBE causes loss of processors and memory with B15050AD logged.
With
the fix, the SBE is able to access the backup PNOR and IPL successfully
by deconfiguring the failing PNOR and calling it out as a failed FRU.
- A problem was fixed for System Vital Product Data (SVPD)
FRUs
being guarded but not having a corresponding error log entry.
This is
a failure to commit the error log entry that has occurred only rarely.
- A problem was fixed for a system going into safe mode
with SRC B1502616 logged as informational without a call home
notification. Notification is needed because the system is
running with reduced performance. If there are unrecoverable
error logs and any are marked with reduced performance and the system
has not been rebooted, then the system is probably running in safe mode
with reduced performance. With the fix, the SRC B1502616 is a
Unrecoverable Error (UE).
- A problem was fixed for the service processor boot
watch-dog timer expiring too soon during DRAM initialization in the
reset/reload, causing the service processor to go unresponsive.
On systems with a single service processor, the SRC B1817212 was
displayed on the control panel. For systems with redundant
service processors, the failing service processor was
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the reset/reloads of the service processor
will work correctly to restore the service processor to a normal
operating state.
- A problem was fixed for host-initiated resets of the
service processor causing the system to terminate. A prior fix
for this problem did not work correctly because some of the
host-initiated resets were being translated to unknown reset types that
caused the system to terminate. With this new correction for
failed host-initiated resets, the service processor will still be
unresponsive but the system and partitions will continue to run.
On systems with a single service processor, the SRC B1817212 will be
displayed on the control panel. For systems with redundant
service processors, the failing service processor will be
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the host-initiated resets of the service
processor will work correctly to restore the service processor to a
normal operating state.
- A problem was fixed for hardware dumps only collecting data
for the master processor if a run-time service processor failover had
occurred prior to the dump. Therefore, there would be only master
chip and master core data in the event of a core unit checkstop.
To recover to a system state that is able to do a full collection of
debug data for all processors and cores after a run-time failover, a
re-IPL of the system is needed.
- A problem was fixed for incorrect error messages from the
Advanced System Management Interface (ASMI) functions when the system
is powered on but in the "Incomplete State". For this
condition, ASMI was assuming the system was powered off because it
could not communicate to the PowerVM hypervisor. With the fix,
the ASMI error messages will indicate that ASMI functions have failed
because of the bad hypervisor connection instead of falsely stating
that the system is powered off.
- A problem was fixed for a single node failure on a
multi-node system preventing an IPL. The error occurred if
Hostboot hung on a node and timed out without calling out problem
hardware. With the fix, a service processor failover is used to
IPL on an alternate path to recover from the error. And an error
log has been added for the IPL timeout for the node with SRC B111BAAB
and a callout for the master processor and PNOR.
- A problem has been fixed for systems losing
performance and going into Safe mode (a power mode with reduced
processor frequencies intended to protect the system from over-heating
and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs
logged. This happened because of an On-Chip Controller
(OCC) internal queue overflow. The problem has only been observed for
systems running heavy workloads with maximum memory configurations
(where every DIMM slot is populated - size of DIMM does not matter),
but this may not be required to encounter the problem. Recovery
from Safe mode back to normal performance can
be done with a re-IPL of the system, or concurrently using the
following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will
require a dynamic celogin password from IBM Support to use the service
processor command line:
1) Log into ASMI as celogin with dynamic celogin password
generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and
this means the system is in normal mode and that Safe mode is not
active.
System firmware changes that affect certain systems
- On systems using PowerVM firmware, a problem
was fixed for
cable card (PCIe3 Optical Cable Adapter for the PCIe3 Expansion Drawer)
capable PCI slots that fail during the IPL. Hypervisor I/O Bus
Interface UE B7006A84 is reported for each cable card capable PCI slot
that doesn't contain a cable card. PCI slots containing a cable
card
will not report an error but will not be functional. The problem
can
be resolved by doing a "power off/power on" re-IPL of the system. The
trigger for the failure is the I2C devices used to detect the cable
cards are not coming out of the power on reset process in the correct
state due to a race condition. The affected optical cable
adapters
have feature codes #EJ05, #EJ07, and #EJ08 with CCINs 2B1C, 6B52, and
2CE2, respectively.
- On systems using PowerVM firmware, a problem was
fixed for a
blank SRC in the LPA dump for user-initiated non-disruptive adjunct
dumps. The SRC is needed for problem determination and dump
analysis.
- On systems using PowerVM firmware, a problem was fixed with
SR-IOV adapter error recovery where the adapter is left in a failed
state in nested error cases for some adapter errors. The
probability
of this occurring is very low since the problem trigger is multiple
low-level adapter failures. With the fix, the adapter is
recovered and
returned to an operational state.
- On systems using PowerVM firmware with PCIe adapters
in Single
Root I/O Virtualization (SR-IOV) shared mode, a problem was fixed for
the hypervisor SR-IOV adjunct partition failing during the IPL with
SRCs B200F011 and B2009014 logged. The SR-IOV adjunct partition
successfully recovers after it reboots and the system is operational.
- On systems using PowerVM firmware, a problem was fixed for
PCIe
Host Bridge (PHB) outages and PCIe adapter failures in the PCIe I/O
expansion drawer caused by error thresholds being exceeded for the LEM
bit [21] errors in the FIR accumulator. These are typically minor
and
expected errors in the PHB that occur during adapter updates and do not
warrant a reset of the PHB and the PCIe adapter failures.
Therefore,
the threshold LEM[21] error limit has been increased and the LEM fatal
error has been changed to a Predictive Error to avoid the outages for
this condition.
- On systems using PowerVM firmware with a large memory
configuration (greater than 8 TB), a problem was fixed for a SR-IOV
adjunct failure during the IPL, causing loss of SR-IOV function.
The
large system memory space causes an overflow in the space calculations
for SR-IOV adapters in PCIe slots with Enlarged IO Capacity
enabled.
The problem can be avoided by reducing the number of PCIe slots with
Enlarged IO Capacity enabled so it does not include adapters in SR-IOV
shared-mode. Another circumvention option is to move the SR-IOV
adapters to SR-IOV capable PCIe slots where Enlarged IO Capacity
is
not enabled. Reducing system physical memory to below 8 TB
will also
work as a circumvention.
- On systems using PowerVM firmware, a problem was fixed for
Live
Partition Mobility (LPM) migrations from FW860.10 or FW860.11 to older
levels of firmware. Subsequent DLPAR of Virtual Adapters will fail with
HMC error message HSCL294C, which contains text similar to the
following: "0931-007 You have specified an invalid drc_name."
This
issue affects partitions installed with AIX 7.2 TL 1 and later. Not
affected by this issue are partitions installed with VIOS, IBM i, or
earlier levels of AIX.
- On a system using PowerVM firmware running a Linux
OS, a problem
was fixed for support for Coherent Accelerator Processor Interface
(CAPI) adapters. The CAPI related RTAS h-calls for the CAPI
devices
could not be made by the Linux OS, impacting the CAPI adapter
functionality and usability. This problem involves the following
adapters: the PCIe3 LP CAPI Accelerator Adapter with F/C #EJ16
that is
used on the S812L(8247-21L) and S822L (8247-22L) models; the
PCIe3
CAPI FlashSystem Acclerator Adapter with F/C #EJ17 that is used
on the
S814(8286-41A) and S824(8286-42A) models; and the PCIe3 CAPI
FlashSystem Accelerator Adapter with F/C #EJ18 that is used on the
S822(8284-22A), E870(9119-MME), and E880(9119-MHE) models. This
problem does not pertain to PowerVM AIX partitions using CAPI adapters.
- On a system using PowerVM firmware, a problem was fixed for
corruption of the partition data in the service processor NVRAM during
a power off that causes the managed system to go into the HMC
"Recovery" error state. A circumvention for the error is to
restore
partition data from the HMC. If using Novalink to manage the
partition, a recovery can be done from the Novalink backup. The
error
is very infrequent but more likely to occur on an immediate power off
of the system. Instead, if a delayed powered off is used, that
would
allow the hypervisor to complete all pending operations before shutting
down cleanly.
- On systems using PowerVM firmware, a problem was fixed for
a group of shared processor partitions being able to exceed the
designated capacity placed on a shared processor pool. This error
can be triggered by using the DLPAR move function for the shared
processor partitions, if the pool has already reached its maximum
specified capacity. To prevent this problem from occurring when
making DLPAR changes when the pool is at the maximum capacity, do not
use the DLPAR move operation but instead break it into two steps:
DLPAR remove followed by DLPAR add. This gives enough time for
the DLPAR remove to be fully completed prior to starting the DLPAR add
request.
- On systems using PowerVM firmware, a problem was fixed for
NVRAM corruption and a HMC recovery state when using Simplified Remote
Restart partitions. The failing systems will have at least one
Remote Restart partition and on the failed IPL there will be a
B70005301 SRC with word 7 being 0X00000002.
- On systems using PowerVM firmware with an IBM i partition,
a problem was fixed for incorrect maximum performance reports based on
the wrong number of "maximum" processors for the system.
Certain performance reports that can be generated on IBMi systems
contain not only the existing machine information, but also "what-if"
information, such as "how would this system perform if it had all the
processors possible installed in this system". This "what-if"
report was in error because the maximum number of processors possible
was too high for the system.
- On systems using PowerVM firmware, a problem was fixed for
NVRAM corruption that can occur when deleting a partition that owns a
CAPI adapter, if that CAPI adapter is not assigned to another partition
before the system is powered off. On a subsequent IPL, the system
will come up in recovery mode if there is NVRAM corruption. To
recover, the partitions must be restored from the HMC. The
frequency of this error is expected to be rare. The CAPI adapters
have the following feature codes: #EC3E, #EC3F, #EC3L, #EC3M,
#EC3T, #EC3U, #EJ16, #EJ17, #EJ18, #EJ1A, and #EJ1B.
- On systems using PowerVM firmware, a problem was fixed for
PCIe3 I/O expansion drawer (#EMX0) link improved stability. The
settings for the continuous time linear equalizers (CTLE) was updated
for all the PCIe adapters for the PCIe links to the expansion drawer.
The CEC must be re-IPLed for the fix to activate.
- On systems using PowerVM firmware, the following
problems were fixed for SR-IOV adapters:
1) Insufficient resources reported for SR-IOV logical port configured
with promiscuous mode enable and a Port VLAN ID (PVID) when creating
new interface on the SR-IOV adapters.
2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV
adapters.
3) Adapter enters firmware loop when single bit ECC error is
detected. System firmware detects this condition as a adapter
command time out. System firmware will reset and restart the
adapter to recover the adapter functionality. This condition will
be reported as a temporary adapter hardware failure.
4) vNIC interfaces not being deleted correctly causing SRC
B400FF01 to be logged and Data Storage Interrupt (DSI) errors with
failiure on boot of the LPAR.
This set of fixes updates adapter firmware to 10.2.252.1926, for the
following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M,
EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using PowerVM firmware, a problem was fixed for
partition boot failures and run time DLPAR failures when adding I/O
that log BA210000, BA210003, and/or BA210005 errors. The fix also
applies to run time failures configuring an I/O adapter following an
EEH recovery that log BA188001 events. The problem can impact
IBMi partitions running in any processor mode or AIX/Linux partitions
running in P7 (or older) processor compatibility modes. The
problem is most likely to occur when the system is configured in the
Manufacturing Default Configuration (MDC) mode. The trigger for
the problem is a race-condition between the hypervisor and the physical
operations panel with a very rare frequency of occurrence.
- On systems with maximum memory configurations (where every
DIMM slot is populated - size of DIMM does not matter), a problem
has been fixed for systems losing performance and going into Safe mode
(a power mode with reduced processor frequencies intended to protect
the system from over-heating and excessive power consumption) with
B1xx2AC3/B1xx2AC4 SRCs logged. This happened because of
On-Chip Controller (OCC) time out errors when collecting Analog Power
Subsystem Sweep (APSS) data, used by the OCC to tune the processor
frequency. This problem occurs more frequently on systems that
are running heavy workloads. Recovery
from Safe mode back to normal performance can be done with a re-IPL of
the system, or concurrently using the following link steps for a soft
reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will
require a dynamic celogin password from IBM Support to use the service
processor command line:
1) Log into ASMI as celogin with dynamic celogin password
generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and
this means the system is in normal mode and that Safe mode is not
active.
|
SC840_147_056 / FW840.40
10/28/16 |
Impact: Availability
Severity: SPE
|
SC840_139_056 / FW840.30
09/28/16 |
Impact: Availability
Severity: SPE |
SC840_132_056 / FW840.24
08/31/16 |
Impact: Availability
Severity: HIPER
System firmware changes that affect certain systems
- HIPER/Non-Pervasive: For
a system using PowerVM firmware at a FW840 level and having an AIX
partition or VIOS partition at specific
back levels, a problem
was fixed for PCI adapters not getting configured in the OS. DVD
boots hang with status code 518 when attempts are made to boot off the
AIX or VIOS DVD image. NIM installs hang with status code
608. If the firmware is updated to 840_104 through 840_118 for a
SAS booted system, the subsequent reboot will hang with status code 554.
The failing AIX and VIOS levels are as follows:
AIX:
AIX 7100-02-06 - AIX 7100-02-07
AIX 6100-08-06 - AIX 6100-08-07
VIOS:
VIOS 2.2.2.6 - VIOS 2.2.2.70
Without the fix, the problem may be circumvented by upgrading the AIX
to 7100-03-03 or 6100-09-03 and the VIOS to 2.2.3.4.
Depending on the adapter not getting configured, the error may result
in Defined devices, EEH errors, and/or failure to boot the partition
(if the failing adapter is the boot device). These errors may
also be seen for a rebooted partition after a LPM migration to FW840.
With the fix applied, the error state for some of the adapters in
the running OS may persist and it will be necessary to reboot the OS to
recover from those errors.
|
SC840_118_056 / FW840.23
07/28/16 |
Impact: Data
Severity: HIPER
System firmware changes that affect certain systems
- HIPER/NON-PERVASIVE:
DEFERRED: On systems with DDR4 memory installed, a problem
was fixed for the handling of data errors in the L4 cache.
If a data error occurs in the L4 cache of the memory buffer on an
affected system and it is pushed out to mainline memory, the data error
will not be correctly handled. A data error originating in
the L4 cache may result in incorrect data being stored into
memory. The DDR4 DRAM has feature code (FC) EM8Y for a 256GB 1600
MHz CDIMM.
At this firmware level, DDR4 and DDR3 memory cannot be mixed in the
system. At FW860.10, DDR4 and DDR3 can be mixed in a system, but
each system node must have either DDR3 or DDR4 only.
IBM strongly recommends that the customer should plan an outage to
install the firmware fix immediately. Fix activation requires a
subsequent platform IPL following the installation of the firmware fix
to eliminate any exposure to this issue.
|
SC840_113_056 / FW840.22
07/06/16 |
Impact: Availability
Severity: ATT |
SC840_111_056 / FW840.21
06/24/16 |
Impact: Availability
Severity: SPE
|
SC840_104_056 / FW840.20
05/31/16 |
Only Deferred fix descriptions
are
displayed for this service pack.
The complete Firmware Fix
History for this Release Level can be
reviewed at the following url:
Impact: Availability
Severity: SPE
System firmware changes that affect all systems
- DEFERRED: A
problem was fixed in the dynamic
ram (DRAM) initialization to update the VREF on the dimms to the
optimal settings and to add an additional margin check test to improve
the reliability of the DRAM by screening out more marginal dimms before
they can result in a run-time memory fault.
System firmware changes that affect certain systems
- DEFERRED: On
systems using PowerVM
firmware, a performance improvement was made by disabling the Hot/Cold
Affinity (HCA) hardware feature, which gathers memory usage statistics
for consumption by partition operating system memory management
algorithms. The statistics gathering can, in rare cases, cause
performance to degrade. The workloads that may experience issues
are
memory-intensive workloads that have little locality of reference and
thus cannot take advantage of hardware memory cache. As a
consequence,
the problem occurs very infrequently or not at all except for very
specific workloads in a HPC environment. This performance fix
requires
an IPL of the system to activate it after it is applied.
- DEFERRED: On
systems using 256GB DDR4 dimms, a problem was fixed in
the 3DS packaging that could result in a recoverable memory
error.
This fix requires an IPL of the system to take effect. Any system
with
DDR4 dimms should be re-IPLed at the next opportunity to do so after
applying this service pack to provide the best running conditions for
the DDR4 dimms for reliable operation.
|
SC840_087_056 / FW840.11
03/18/16 |
Impact: Availability
Severity: ATT |
SC840_079_056 / FW840.10
03/04/16 |
Impact: Availability
Severity: SPE |
SC840_056_056 / FW840.00
12/04/15 |
Impact:
New
Severity: New
|