SC860
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The following Fix description table will
only contain the N (current) and N-1 (previous) levels.
The complete Firmware Fix History for this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
|
SC860_118_056 / FW860.40
11/08/17 |
Impact: Availability
Severity: SPE
New features and functions
- Support was added to the Advanced System Management
Interface (ASMI) for providing an "All of the above" cable validation
display option so that each individual cable option does not have to be
selected to get a full report on the cable status. Select
"System Service Aids -> Cable Validation -> Display Cable
Status" "All of the above" and click "Continue" to
see the status of all the cables.
System firmware changes that affect all systems
- A problem was fixed
for recovery from clock card loss of lock failures that resulted in a
clock card FRU unnecessarily being called out for repair. This
error happened whenever there was a loss of lock (PLL or CRC) for the
clock card. With the fix, the firmware will not be calling out
the failing clock card, but rather it will be reconfigured as the new
backup clock card after doing a clock card failover. Customers
will see a benefit from improved system availability by the avoidance
of disruptive clock card repairs.
- A problem was fixed for the "Minimum code level supported"
not being shown by the Advanced System Management Interface (ASMI) when
selecting the "System Configuration/Firmware Update Policy" menu.
The message shown is "Minimum code level supported value has not been
set". The workaround to find this value is to use the ASMI
command line interface with the "registry -l cupd/MinMifLevel" command.
- A problem was fixed for "sh: errl: not found " error
messages to the service processor console whenever the Advanced System
Management Interface (ASMI) was used to display error logs. These
messages did not cause any problems except to clutter the console
output as seen in the service processor traces.
- A problem was fixed for the LineInputVoltage and
LastPowerOutputWatts being displayed in millivolts and milliwatts,
respectively, instead of volts and watts for the output from the
Redfish API for power properties for the chassis. The URL
affected is the following: "https://<fsp
ip>/redfish/v1/Chassis/<id>/Power"
- A problem was fixed for system node fans going to maximum
RPM speeds after a service processor failover that needed the On-Chip
Controllers (OCC) to be reloaded. Without the fix, the system
node fan speeds can be restored to normal speed by changing the Power
Mode in the Advanced System Management Interface using steps from the
IBM Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/areaa_pmms.htm.
After changing the Power Mode, wait about 10 minutes to change the
Power Mode back to the original setting.
If the fix is applied without rebooting the system, the system node fan
speeds can be corrected by either changing the Power Mode as above or
using the HMC to do an Administrative Failover (AFO).
- A problem was fixed for a Power Supply Unit (PSU) failure
of SRC 110015xF logged with a power supply fan call out
when doing a hot re-plug of a PSU. The power supply may be
made operational again by doing a dummy replace of the PSU that was
called out (keeping the same PSU for the replace operation). A
re-IPL of the system will also recover the PSU.
- A problem was fixed for the service processor low-level
boot code always running off the same side of the flash image,
regardless of what side has been selected for boot ( P-side or
T-side). Because this low-level boot code rarely changes, this
should not cause a problem unless corruption occurs in the flash image
of the boot code. This problem does not affect firmware
side-switches as the service processor initialization code
(higher-level code than the boot code) is running correctly from the
selected side. Without the fix, there is no recovery for boot
corruption for systems with a single service processor as the service
processor must be replaced.
- A problem was fixed for a missing serviceable event from a
periodic call home reminder. This occurred if there was an FRU
deconfigured for the serviceable event.
- A problem was fixed for help text in the Advanced System
Management Interface (ASMI) not informing the user that system fan
speeds would increase if the system Power Mode was changed to "Fixed
Maximum Frequency" mode. If ASMI panel function "System
Configuration->Power Management->Power Mode Setup" "Enable Fixed
Maximum Frequency mode" help is selected, the updated text states
"...This setting will result in the fans running at the maximum speed
for proper cooling."
- A problem was fixed for a degraded PCI link causing a
Predictive SRC for a non-cacheable unit (NCU) store time-out that
occurred with SRC B113E540 or B181E450 and PRD signature
"(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB". With the fix,
the error is changed to be an Informational as the problem is not with
the processor core and the processor should not be replaced. The
solution for degraded PCI links is different from the fix for this
problem, but a re-IPL of the CEC or a reset of the PCI adapters could
help to recover the PCI links from their degraded mode.
- A problem was fixed for a Redfish Patch on the
"Chassis" "HugeDynamicDMAWindowSlotCount" for the validation of
incorrect values. Without the fix, the user will not get proper
error messages when providing bad values to the patch.
System firmware changes that affect certain systems
- DEFERRED: On
systems using
PowerVM firmware, a problem was fixed for DPO (Dynamic Platform
Optimizer) operations taking a very long and impacting the server
system with a performance degradation. The problem is triggered
by a
DPO operation being done on a system with unlicensed processor cores
and a very high I/O load. The fix involves using a different lock
type
for the memory relocation activities (to prevent lock contention
between memory relocation threads and partition threads) that is
created at IPL time, so an IPL is needed to activate the fix.
More
information on the DPO function can be found at the IBM Knowledge
Center: https://www.ibm.com/support/knowledgecenter/en/8247-42L/p8hat/p8hat_dpoovw.htm
- On systems using PowerVM firmware, a problem was
fixed for an intermittent service processor core dump and a callout for
netsCommonMSGServer with SRC B181EF88. The HMC connection
to the service processor automatically recovers with a new session.
- On systems using PowerVM firmware, a problem was fixed
where the Power Enterprise Pool (PEP) grace period expired early, being
short by one hour. For example, 71 hours may be provided instead
of 72 hours in some cases. See https://www.ibm.com/support/knowledgecenter/en/POWER8/p8ha2/entpool_cod_compliance.htm
for more information about the PEP grace period.
- On systems using PowerVM firmware, a problem was fixed for
a concurrent firmware update failure with HMC error message
"E302F865-PHYPTooBusyToQuiesce". This error can occur when the
error log is full on the hypervisor and it cannot accept more error
logs from the service processor. But the service processor keeps
retrying the send of an error log, resulting in a "denial of service"
scenario where the hypervisor is kept busy rejecting the error logging
attempts. Without the fix, the problem may be circumvented by
starting a logical partition (if none are running) or by purging
the error logs on the service processor.
- On systems using PowerVM firmware with mirrored memory
running IBM i partitions, a problem was fixed for memory fails in the
partition that also caused the system to crash. The system
failure will occur any time that IBM i partition memory towards the
beginning of the partition's assigned memory fails. With the fix,
the memory failure is isolated to the impacted partition, leaving the
rest of the system unaffected.
- On systems using PowerVM firmware, a problem was fixed for
failures deconfiguring SR-IOV Virtual Functions (VFs). This can
occur during Live Partition Mobility (LPM) migrations with HMC error
messages of HSCLAF16, HSCLAF15 and HSCLB602 shown. This results
in an LPM migration failure and a system reboot is required to recover
the VFs for the I/O adapters. This error may occur more
frequently in cases where the I/O adapter has pending I/O at the time
of the deconfigure request for the VF.
- On systems using PowerVM firmware, a problem was fixed for
a vNIC client that has backing devices being assigned an active server
that was not the one intended by an HMC user failover for the client
adapter. This only can happen if the vNIC client adapter had
never been activated. A circumvention is to activate the client
OS and initialize the vNIC device (ifconfig "xxx" up) and an active
backing device will then be selected.
- On systems using PowerVM firmware, a problem was fixed for
partitions with more than 32TB memory failing to IPL with memory space
errors. This can occur if the logical memory block (LMB) size is
small as there is a memory loss associated with each LMB. The
problem can be circumvented by reducing the amount of partition memory
or increasing the LMB size to reduce the total number of LMBs needed
for the memory allocation.
- On systems using PowerVM firmware, a problem was
fixed for the error handling of EEH events for the SR-IOV Virtual
Functions (VFs) that can result in IPL failure with B7006971, B400FF05,
and BA210000 SRCs logged. In these cases, the partition console
stops at an OFDBG prompt. Also, a DLPAR add of a VF may result in
a partition crash due to a 300 DSI exception because of a low-level EEH
event. A circumvention for the problem would be to debug the EEH
events which should be recovered errors and eliminate the cause of the
EEH events. With the fix, the EEH events still log Predictive
Errors but do not cause a partition failure.
- On systems using PowerVM firmware, a problem was fixed for
Power Enterprise Pool (PEP) "not applicable" error messages being
displayed when re-entering PEP XML files for PEP updates, in which one
of the XML operations calls for Conversion of Perm Resources to PEP
Resources. There is no error as the PEP key was accepted on the
first use. The following message may be seen on the HMC and can
be ignored: "...HSCL0520 A Mobile CoD processor conversion
code to convert 0 permanently activated processors to Mobile CoD
processors on the managed system has been entered. HSCL050F This
CoD code is not valid for your managed system. Contact your CoD
administrator."
- On systems using PowerVM firmware, a problem was fixed for
Power Enterprise Pool (PEP) busy errors from the system anchor card
when creating or updating a PEP pool. The error
returned by the HMC is "HSCL9015 The managed system cannot currently
process this operation. This
condition is temporary. Please try the operation again." To
try again, the customer needs to update the pool again. Typically
on the second PEP update, the code is accepted.
The problem is intermittent and occurs only rarely.
- On systems using PowerVM firmware, a problem was fixed for
an invalid date from the service processor causing the customer date
and time to go to the Epoch value (01/01/1970) without a warning or
chance for a correction. With the fix, the first IPL
attempted on an invalid date will be rejected with a message alerting
the user to set the time correctly in the service processor. If
the warning is ignored and the date/time is not corrected, the next IPL
attempt will complete to the OS with the time reverted to the Epoch
time and date. This problem is very rare but it has been known to
occur on service processor replacements when the repair step to set the
date and time on the new service processor was inadvertently skipped by
the service representative.
- On systems using PowerVM firmware, a problem was fixed for
a Power Enterprise Pool (PEP) system losing its assigned processor and
memory resources after an IPL of the system. This is an
intermittent problem caused by a small timing window that makes it
possible for the server to not get the IPL-time assignment of resources
from the HMC. If this problem occurs, it can be corrected by the
HMC to recover the pool without needing another IPL of the system.
- On systems using PowerVM firmware with PowerVM NovaLink, a
problem was fixed for a lost of a communications channel between the
hypervisor and the PowerVM NovaLink during a reset of the service
processor. Various NovaLink tasks, including deploy, could fail
with a "No valid host was found" error. With the fix, PowerVM
NovaLink prevents normal operations from being impacted by a reset of
the service processor.
- On systems using PowerVM firmware, a problem was fixed for
a rare system hang caused by a process dispatcher deadlock timing
window. If this problem occurs, the HMC will also go to an
"Incomplete" state for the managed system.
- On systems using PowerVM firmware, a problem
was fixed for communication failures on adapters in SR-IOV shared
mode. This communication failure only occurs when a logical
port's VLAN ID ( PVID) is dynamically changed from non-zero to
zero. An SR-IOV logical port is an I/O device created for a
partition or a partition profile using the management console (HMC)
when a user intends for the partition to access an SR-IOV adapter
Virtual Function. The error can be recovered from by a reboot of
the partition.
This fix updates adapter firmware to 10.2.252.1929, for the following
Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K,
EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems using PowerVM firmware, a problem was fixed for
error logs not getting sent to the OS running in a
partition. This problem could occur if the error log buffer
was full in the hypervisor and then a re-IPL of the system
occurred. The error log full condition was persisting across the
re-IPL, preventing further logs from being sent to the OS.
- On systems using PowerVM firmware, a problem was fixed in
the text for the Firmware License agreement to correct a link that
pointed to a URL that was not specific to microcode licensing.
The message is displayed for a machine during its initial power
on. Once accepted, the message is not displayed again. The
fixed link in the licensing agreement is the following: http://www.ibm.com/support/docview.wss?uid=isg3T1025362.
|
SC860_103_056 / FW860.30
06/30/17 |
Impact: Availability
Severity: SPE
New features and functions
- Support was added for Redfish API to allow the ISO 8610
extended format for the time and date so that the date/time can be
represented as an offset from UTC (Universal Coordinated Time).
- Support for the Redfish API for power and thermal
properties for the chassis. The new URIs are as follows::
https://<fsp ip>/redfish/v1/Chassis/<id>/Power :
Provides fan data
https://<fsp ip>/redfish/v1/Chassis/<id>/Thermal : Provides
power supply data
Only the Redfish GET operation is supported for these resources.
System firmware changes that affect all systems
- A problem was fixed
for service actions with SRC B150F138 missing an Advanced System
Management Interface (ASMI) Deconfiguration Record. The
deconfiguration records make it easier to organize the repairs that are
needed for the system and they need to be consistent with the periodic
maintenance reminders that are logged for the failed FRUs.
- A problem was fixed for a false 1100026B1 (12V power good
failure) caused by an I2C bus write error for a LED state. This
error can be triggered by the fan LEDs changing state.
- A problem was fixed for a fan LED turning amber on solid
when there is no fan fault, or when the fan fault is for a different
fan. This error can be triggered anytime a fan LED needs to
change its state. The fan LEDs can be recovered to a normal state
concurrently using the following link steps for a soft reset of the
service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for sporadic blinking amber LEDs for
the system fans with no SRCs logged. There was no problem with
the fans. The LED corruption occurred when two service processor
tasks attempted to update the LED state at the same time. The fan
LEDs can be recovered to a normal state concurrently using the
following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for a Redfish Patch on the "Chassis" or
"IBMEnterpriseComputerSystem" with empty data that caused a "500
Internal Server Error". Validation for the empty data case has
been added to prevent the server error.
- A problem was fixed for hardware dumps only collecting data
for the master processor if a run-time service processor failover had
occurred prior to the dump. Therefore, there would be only master
chip and master core data in the event of a core unit checkstop.
To recover to a system state that is able to do a full collection of
debug data for all processors and cores after a run-time failover, a
re-IPL of the system is needed.
- A problem was fixed for a Redfish Patch on power mode to
"MaxPowerSaver" that caused a "500 Internal Server Error" when
that power mode was not supported on the system. With the fix,
the Redfish server response is a list of the valid power modes that be
used for the system.
- A problem was fixed for the loss of Operations Panel
function 30 (displaying ethernet port HMC1 and HMC2 IP addresses)
after a concurrent repair of the Operations Panel.
Operations Panel function 30 can be restored concurrently using
the following link steps for a soft reset of the service
processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for a core dump of the rtiminit
(service processor time of day) process that logs an SRC B15A3303
and could invalidate the time on the service processor. If the
error occurs while the system is powered on, the hypervisor has the
master time and will refresh the service processor time, so no action
is needed for recovery. If the error occurs while the system is
powered off, the service processor time must be corrected on the
systems having only a single service processor. Use the following
steps from the IBM Knowledge Center to change the UTC time with the
Advanced System Management Interface: https://www.ibm.com/support/knowledgecenter/en/POWER8/p8hby/viewtime.htm.
- A problem was fixed for the service processor boot
watch-dog timer expiring too soon during DRAM initialization in the
reset/reload, causing the service processor to go unresponsive.
On systems with a single service processor, the SRC B1817212 was
displayed on the control panel. For systems with redundant
service processors, the failing service processor was
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the reset/reloads of the service processor
will work correctly to restore the service processor to a normal
operating state.
- A problem was fixed for host-initiated resets of the
service processor causing the system to terminate. A prior fix
for this problem did not work correctly because some of the
host-initiated resets were being translated to unknown reset types that
caused the system to terminate. With this new correction for
failed host-initiated resets, the service processor will still be
unresponsive but the system and partitions will continue to run.
On systems with a single service processor, the SRC B1817212 will be
displayed on the control panel. For systems with redundant
service processors, the failing service processor will be
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the host-initiated resets of the service
processor will work correctly to restore the service processor to a
normal operating state.
- A problem was fixed for a service processor reset triggered
by a spurious false IIC interrupt request in the kernel. On
systems with a single service processor, the SRC B1817201 is displayed
on the Operator Panel. For systems with redundant service
processors, an error failover to the backup service processor
occurs. The problem is extremely infrequent and does not impact
processes on the running system.
- A problem was fixed for the System Attention LED failing to
light for an error failover for the redundant service processors with
an SRC B1812028 logged.
- A problem was fixed for a system failure at run time with
SRC B111E450 corefir(55) that could not reIPL. A system node
should have been deconfigured for an ABUS error on a processor chip but
instead, the system was terminated. To recover from this problem,
manually guard the node containing the failed processor and then the
IPL will be successful.
- A problem was fixed for an incorrect Redfish error message
when trying to use the $metadata URI: "The resource at the
URI https://<systemip>/redfish/v1/%24metadata was not found.".
This %24 is meaningless. The "%24" has been replaced with a "$"
in the error message. The Redfish $metadata URI is not supported.
- A problem was fixed for a system failure caused by Host
boot problems with one node but the other nodes good. With the
fix, the node that is failing the Hostboot is deconfigured and the
system is able to IPL on the remaining nodes. To recover from
this problem, manually guard the node that is failing and reIPL.
System firmware changes that affect certain systems
- DEFERRED: On
systems using PowerVM firmware, a problem was fixed for PCIe3 I/O
expansion drawer (#EMX0) link improved stability. The settings
for the continuous time linear equalizers (CTLE) was updated for all
the PCIe adapters for the PCIe links to the expansion drawer. The
system must be re-IPLed for the fix to activate.
- On systems using
PowerVM firmware with a Linux Little Endian (LE) partition, a problem
was fixed for system reset interrupts returning the wrong values in the
debug output for the NIP and MSR registers. This problem reduces
the ability to debug hung Linux partitions using system reset
interrupts. The error occurs every time a system reset interrupt
is used on a Linux LE partition.
- On systems using PowerVM firmware, a problem was fixed for
"Time Power On" enabled partitions not being capable of suspend and
resume operations. This means Live Partition Mobility (LPM) would
not be able to migrate this type of partition. As a workaround,
the partition could be transitioned to a "Non-time Power On" state and
then made capable of suspend and resume operations.
- On systems using PowerVM firmware, a problem was fixed for
manual vNIC failovers (from the HMC, manually "Make the Backing Device
Active") so that the selected server was chosen for the failover,
regardless of its priority. With the problem, the server chosen
for the VNIC failover will be the one with the most favorable
priority.
There are two possible workarounds to the problem:
(1) Disable auto-priority-failover; Change priority to the server that
is needed as the target of the failover; Force the vNIC failover;
Change priority back to original setting.
(2) Or use auto-priority-failover and change the priority so the server
that is needed as the target of the failover is favored.
- On systems using PowerVM firmware, a problem was fixed for
extra error logs in the VIOS due to failovers taking place while the
client vNIC is inactive. The inactive client vNIC failovers are
skipped unless the force flag is on. With the problem occurring,
Enhanced Error Handling (EEH) Freeze/Temporary Error/Recovery logs
posted in the VIOS error log of the client partition boot can be
ignored unless an actual problem is experienced.
- On systems using PowerVM firmware, a problem was fixed for
a Live Partition Mobility (LPM) migration abort and reboot on the
FW860 target CEC caused by a mismatched address space for the
source and target partition. The occurrence of this problem is
very rare and related to performance improvements made in the memory
management on the FW860 system that exposed a timing window in the
partition memory validation for the migration. The reboot of the
migrated partition recovers from the problem as the migration was
otherwise successful.
- On systems using PowerVM firmware, a problem was fixed for
reboot retries for IBM i partitions such that the first load source I/O
adapter (IOA) is retried instead of bypassed after the first failed
attempt. The reboot retries are done for an hour before the
reboot process gives up. This error can occur if there is more
than one known load source, and the IOA of the first load source is
different from the IOA of the last load source. The error can be
circumvented by retrying the boot of the partition after the load
source device has become available.
- On systems using PowerVM firmware, a problem was fixed for
adapters failing to transition to shared SR-IOV mode on the IPL after
changing the adapter from dedicated mode. This intermittent
problem could occur on systems using SR-IOV with very large memory
configurations.
- On systems using PowerVM firmware, a problem
was fixed for SR-IOV adapters in shared mode for a transmission stall
or time out with SRC B400FF01 logged. The time out happens during
Virtual Function (VF) shutdowns and during Function Level Resets (FLRs)
with network traffic running.
This fix updates adapter firmware to 10.2.252.1927, for the following
Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M, EN0N, EN0K,
EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- On systems with maximum memory configurations (where every
DIMM slot is populated - size of DIMM does not matter), a problem
has been fixed for systems losing performance and going into Safe mode
(a power mode with reduced processor frequencies intended to protect
the system from overheating and excessive power consumption) with
B1xx2AC3/B1xx2AC4 SRCs logged. This happened because of On-Chip
Controller (OCC) timeout errors when collecting Analog Power Subsystem
Sweep (APSS) data, used by the OCC to tune the processor
frequency. This problem occurs more frequently on systems that
are running heavy workloads. Recovery from Safe mode back to
normal performance can be done with a re-IPL of the system, or
concurrently using the following link steps for a soft reset of the
service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will
require a dynamic celogin password from IBM Support to use the service
processor command line:
1) Log into ASMI as celogin with dynamic celogin password
generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and
this means the system is in normal mode and that Safe mode is not
active.
- A problem has been fixed for systems losing
performance and going into Safe mode (a power mode with reduced
processor frequencies intended to protect the system from overheating
and excessive power consumption) with B1xx2AC3/B1xx2AC4 SRCs
logged. This happened because of an On-Chip Controller (OCC)
internal queue overflow. The problem has only been observed for systems
running heavy workloads with maximum memory configurations (where every
DIMM slot is populated - size of DIMM does not matter), but this may
not be required to encounter the problem. Recovery from Safe mode
back to normal performance can be done with a re-IPL of the system, or
concurrently using the following link steps for a soft reset of the
service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will
require a dynamic celogin password from IBM Support to use the service
processor command line:
1) Log into ASMI as celogin with dynamic celogin password
generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and
this means the system is in normal mode and that Safe mode is not
active.
- On systems using PowerVM firmware, a problem
was fixed for a partition boot from a USB 3.0 device that has an error
log SRC BA210003. The error is triggered by an Open Firmware
entry to the trace buffer during the partition boot. The error
log can be ignored as the boot is successful to the OS.
- On systems using PowerVM firmware, a problem
was fixed for a partition boot fail or hang from a Fibre Channel device
having fabric faults. Some of the fabric errors returned by the
VIOS are not interpreted correctly by the Open Firmware VFC drive,
causing the hang instead of generating helpful error logs.
- On systems with redundant service processors, a
problem was fixed for an extra SRC B150F138 logged for a power supply
that had already been replaced. The problem was triggered by a
service processor failover and an old power supply fault event that was
not cleared on the backup service processor. This caused the SRC
B150F138 to be logged for a second time. This problem can be
circumvented by clearing the error log associated with the bad FRU when
the FRU is replaced.
- On systems using PowerVM firmware, a problem was fixed for
a Power Enterprise Pool (PEP) resource Grace Period not being reset
when the server is in the "Out of Compliance" state and the resource
has been returned to put the server back in Compliance. The Grace
Period was not being reset after a double-commit of a resource (doing
an "remove" of an active resource) was resolved by restarting the
server with the double-committed resource. When Grace Period ends, the
"double-committed" resources on the server have to have been freed up
from use to prevent the server from going to "Out of Compliance".
If the user fails to free up the resource, the PEP is in an "Out of
Compliance" state, and the only PEP actions allowed are ones to free up
the double-commit. Once that is completed, the PEP is back In
Compliance. The loss of the Grace Period for the error makes it
difficult to move resources around in the PEP. Without the fix,
the user can "Add" another PEP resource to the server, and the
action of adding a PEP resource resets the Grace Period timer.
One could then "Remove" that one PEP resource just added, and then any
further "removes" of PEP resources would behave as expected with the
full Grace Period in effect.
- On systems using PowerVM firmware, a problem was
fixed for Power Enterprise Pool (PEP) IFL processors assignments
causing an "Out of Compliance" for normal processor licenses. The
number of IFL processors purchased was first credited as satisfying any
"unreturned" PEP processor resources, thus potentially leaving the
system "Out Of Compliance" since IFL processors should not be taking
the place of the normal (expensive) processor usage. In this
situation, without the fix, the user will need to either purchase more
"expensive" non-IFL processors to satisfy the non-IFL workloads or
adjust the partitions to reduce the usage of non-IFL processors.
This is a very infrequent problem for the following reasons:
1) PEP processors are infrequently left "unreturned" for short periods
of time for specialized operations such as LPM migrations
2) The user would have to purchase IFL processors from IBM, which is
not a common occurrence.
3) The user would have to put in a COD key for IFL processors while a
PEP processor is still "unreturned"
- On systems using PowerVM firmware, a problem
was fixed for a power off hanging at D200C1FF caused by a vNIC VF
failover error with SRC B200F011. The power off hang error is
infrequent because it requires that a VF failover error having occurred
first. The system can be recovered by using the power off
immediate option from the Hardware Management Console (HMC).
- On systems using PowerVM firmware, a problem was fixed for
the incorrect reporting of the Universally Unique Identifier (UUID) to
the OS, which prevented the tracking of a partition as it moved within
a data center. The UUID value as seen on HMC or the NovaLink did
not match the value as displayed in the OS.
- On systems using PowerVM firmware, a problem was fixed for
an error finding the partition load source that has a GPT format.
GUID Partition Table (GPT) is a standard for the layout of the
partition table on a physical storage device used in the server, such
as a hard disk drive or solid-state drive, using globally unique
identifiers (GUID). Other drives that are working may be using
the older master boot record (MBR) partition table format. This
problem occurs whenever load sources utilizing the GPT format occur in
other than the first entry of the boot table. Without the fix, a
GPT disk drive must be the first entry in the boot table to be able to
use it to boot a partition.
- On systems using PowerVM firmware, a problem was fixed for
an SRC BA090006 serviceable event log occurring whenever an attempt was
made to boot from an ALUA (Asymmetric Logical Unit Access)
drive. These drives are always busy by design and cannot be used
for a partition boot, but no service action is required if a user
inadvertently tries to do that. Therefore, the SRC was changed to
be an informational log.
|
SC860_082_056 / FW860.20
03/17/17 |
Impact: Availability
Severity: SPE
|
SC860_070_056 / FW860.12
01/13/17 |
Impact: Availability
Severity: SPE |
SC860_063_056 / FW860.11
12/05/16 |
Impact:
N/A
Severity: N/A
- This Service Pack contained updates for MANUFACTURING
ONLY.
|
SC860_056_056 / FW860.10
11/18/16 |
Impact:
New
Severity: New
System firmware changes that affect certain systems
- DISRUPTIVE:
On systems using the PowerVM
firmware, a problem was fixed for an "Incomplete" state caused by
initiating a resource dump with selector macros from NovaLink (vio
-dump -lp 1 -fr). The failure causes a communication
process stack
frame, HVHMCCMDRTRTASK, size to be exceeded with a hypervisor page
fault that disrupts the NovalLink and/or HMC communications. The
recovery action is to re-IPL the CEC but that will need to be done
without the assistance of the management console. For each
partition
that has a OS running on the system, shut down each partition from the
OS. Then from the Advanced System Management Interface
(ASMI), power
off the managed system. Alternatively, the system power button
may
also be used to do the power off. If the management console
Incomplete
state persists after the power off, the managed system should be
rebuilt from the management console. For more information on
management console recovery steps, refer to this IBM Knowledge Center
link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm.
The fix is disruptive because the size of the PowerVM hypervisor must
be increased to accommodate the over-sized stack frame of the failing
task.
- DEFERRED: On
systems using the PowerVM
firmware, a problem was fixed for a CAPI function unavailable condition
on a system with the maximum number of CAPI adapters and
partitions.
Not enough bytes were allocated for CAPI for the maximum configuration
case. The problem may be circumvented by reducing the number of
active
partitions or CAPI adapters. The fix is deferred because
the size of
the hypervisor must be increased to provide the additional CAPI space.
- DEFERRED:
On systems using PowerVM
firmware, a problem was fixed for cable card capable PCI slots that
fail during the IPL. Hypervisor I/O Bus Interface UE B7006A84 is
reported for each cable card capable PCI slot that doesn't
contain a
PCIe3 Optical Cable Adapter for the PCIe Expansion Drawer (feature code
#EJ05). PCI slots containing a cable card will not report an
error but
will not be functional. The problem can be resolved by performing
an
AC cycle of the system. The trigger for the failure is the I2C
devices
used to detect the cable cards are not coming out of the power on reset
process in the correct state due to a race condition.
|