IBM POWER9 Systems LC Server Firmware
Applies to: AC922 (8335-GTG)
This document provides information about the installation of Licensed Machine or Licensed Internal Code, which is sometimes referred to generically as microcode or firmware.
This package provides firmware for the Power System AC922 (8335-GTG) server only.
The firmware level in this package is:
•OP910.70 / OP9_v2.0.15_1.38 / OpenBMC ibm-v2.3-476-r47
Note 1: If your current firmware level is less than OP910.22, you must first update to OP910.22 and then update to the later levels. If the update to the firmware level OP910.22 is skipped, the BMC will fail on the code update and it will be dead. If this happens, IBM Support needs to be contacted so that the BMC card can be replaced. The OP910.30 and later levels require more space for the BMC image, so before updating to these levels, the BMC needs the fix that increases the space for BMC image.
Note 2: Before updating to OP910.24 and newer firmware levels, ensure that the Linux OS is at RHEL 7.5-ALT LE with the third Z-stream or later and the NVIDIA CUDA driver for the NVIDIA Tesla GPUs on the system is at the recommended driver level 396.44 or later, or the minimum level 396.26. See "1.4 Required level for NVIDIA CUDA driver for the Tesla V100 GPU" for more information. After the firmware update, ensure that the BCM1579 ethernet driver is updated to level, 5719-v1.43 NCSI v1.4.22.0. See "1.5 Required Broadcom Ethernet driver level for the BCM5719". The complete set of update instructions covering the OS, CUDA driver, firmware, and Ethernet driver can be found in the readme guide on Fix Central called "WSP_CUDA_BCM5719_FWUPG_GUIDE.txt".
Note: After updating to this firmware level, it is necessary to do a manual check to validate that the SBE image is correct. Follow the steps in section "7.1 SBE Validation Steps " to complete the check before trying to boot and use the system on the new firmware level.
This section specifies the "Minimum ipmitool Code Level" required by the System Firmware for managing the system. Open Power requires ipmitool level v1.8.15 or later to execute correctly on the OP910 firmware. It must be capable of establishing a IPMI v2 session with the ipmi support on the BMC.
Verify your ipmitool level on your linux workstation using the following command:
bash-4.1$ ipmitool -V
ipmitool version 1.8.15
If you are need to update or add impitool to your Linux workstation , you can compile ipmitools (current level 1.8.15) for Linux as follows from the Sourceforge:
1.1.1 Download impitool tar from http://sourceforge.net/projects/ipmitool/ to your linux system
1.1.2 Extract tarball on linux system
1.1.3 cd to top-level directory
1.1.4 ./configure
1.1.5 make
1.1.6 ipmitool will be under src/ipmitool
You may also get the ipmitool package directly from your workstation linux packages.
For specific fix level information on key components of IBM Power Systems LC and Linux operating systems, please refer to the documentation in the IBM Knowledge Center for the AC922 (8335-GTG):
https://www.ibm.com/support/knowledgecenter/POWER9/p9hdx/8335_gtg_landing.htm
If using xCAT on the host OS to do firmware updates, the minimum xCAT level that should be used is 2.13.4 because it has stability improvements for the firmware update process. See the xCAT 2.13.4 release notes below for more information.
https://github.com/xcat2/xcat-core/wiki/XCAT_2.13.4_Release_Notes
The Linux OS has a NVIDIA CUDA driver that must be at recommended level 396.44 or later, or minimum level 396.26 to be compatible with OP910.24. Without this driver, a GPU which has faulted and gone through a GPU reset can cause a Terminate Immediate (TI) for the system. The recommended level for the NVIDIA CUDA driver is level 396.44 to get ATS performance improvements.
The Power AC922 server delivers four Tesla V100 with NVLink GPUs supported in two processor sockets.
Feature #EC4J provides the NVIDIA Tesla V100 GPU with NVLINK Air-Cooled (16 GB). CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce.
The Tesla CUDA driver can be obtained at the download NVIDIA link of “https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/396.44/nvidia-driver-local-repo-rhel7-396.44-1.0-1.ppc64le.rpm&lang=us&type=Tesla”
The NVIDIA "http://www.nvidia.com/Download/index.aspx?lang=en-us" link using the following information can be used to do a manual search for the driver:
Manually find drivers for my NVIDIA products.
Product Type: Tesla
Product Series: V-Series
Product: Tesla V100
Operating System: Linux POWER LE RHEL 7
CUDA Toolkit: 9.2
Language: English(US)
Search results:
Version: 396.44
Release Date: 2018.8.6
Operating System: Linux POWER LE RHEL 7
CUDA Toolkit: 9.2
Language: English (US)
File Size: 47.28 MB
The tools and driver images are provided in Fix Central to update the BCM5719 ethernet adapter to NCSI level v1.4.22.0.
Use the steps provided in the WSP_CUDA_BCM5719_FWUPG_GUIDE.txt readme file to perform the needed updates.
I/O Adapter driver level before update:
Dual port BCM5719 with shared port with BMC (NCSI)
Adapter FW: 5719-v1.43 NCSI v1.3.12.0
I/O Adapter level after update:
firmware-version: 5719-v1.43 NCSI v1.4.22.0
Downgrading firmware from any given release level to an earlier release level is not recommended.
If you feel that it is necessary to downgrade the firmware on your system to an earlier release level, please contact your next level of support.
Concurrent Firmware Updates not available for LC servers.
Concurrent system firmware update is not supported on LC servers.
IPLs may on rare occasion fail with a Hostboot timeout error with message like "Unrecoverable Hardware Failure, (Critical) Hostboot procedure callout." This failure will halt the IPL and a re-IPL should fully recover the system from the failure. The system will recover and re-IPL from the fault without user intervention if the Auto-Reboot policy is enabled (the factory default setting). If the user has set Auto-Reboot policy to disabled, then the server will remain in the halt state if it encounters such a failure and will require the user to power off/on/reboot to IPL again successfully. If the error persists, the user should follow any service procedures provided with the error log which may include an action to determine and apply the latest level of system firmware on the server. A final action would be to contact IBM Support for assistance.
Here is an example what would be seen in the system SEL list for the problem:
----Active Alerts----
Entry | ID | Timestamp | Serviceable | Severity | Message | eSEL contents
1 | FQPSPCR0023M | 2019-02-08 20:14:39 | Yes | Critical | Hostboot has become unresponsive | None
2 | FQPSPCR0023M | 2019-02-08 23:04:40 | Yes | Critical | Hostboot has become unresponsive | None
The IBM Knowledge Center document on this problem is titled " FQPSPCR0023M" and can be found at the following link: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9ia7/FQPSPCR0023M.htm.
The Auto-Reboot policy can be read and changed to be enabled using REST APIs, if needed, with the following steps:
1) Establish a REST login session to the BMC and capture the session authentication to a cookie jar file:
curl -k -H "Content-Type: application/json" -X POST https://<BMC IP Address>/login -d "{\"data\": [ \"root\", \"BMC root password\" ] }" -c cjar
{
"data": "User 'root' logged in",
"message": "200 OK",
"status": "ok"
}
2) Read the current Auto-Reboot policy ("1" is enabled, "0" is disabled):
$ curl -b cjar -k -H "Content-Type: application/json" -X GET https://${BMC_IP}/xyz/openbmc_project/control/host0/auto_reboot/attr/AutoReboot
{
"data": 1,
"message": "200 OK",
"status": "ok"
}
3) If the Auto-Reboot policy is not "1" (enabled), it can be set to enabled using the following REST command:
$ curl -b cjar -k -H "Content-Type: application/json" -X PUT https://${BMC_IP}/xyz/openbmc_project/control/host0/auto_reboot/attr/AutoReboot -d '{"data":1}'
{
"data": null,
"message": "200 OK",
"status": "ok"
}
Use the following examples as a reference to determine whether your installation will be concurrent or disruptive.
For the LC server systems, the installation of system firmware is always disruptive.
The BMC and PNOR image tar files are used to update the primary side of the PNOR and the primary side of the BMC only, leaving the golden sides unchanged.
Additional files published:
1.WSP_CUDA_BCM5719_FWUPG_GUIDE.txt - readme file for update sequence of steps for OS, GPU and Ethernet drivers, and Firmware
2.fix_bcm_5719_crc.py - bcm5719 driver install script
3.python3_fix_bcm_5719_crc.py - bcm5719 driver install script (python3 version)
4.lnxfwupg.zip - Broadcom driver update files
5.nx1_ncsi_v1.4.22_PointDrop.zip - Broadcom NCSI driver image
Filename | Size | Checksum |
obmc-witherspoon-ibm-v2.3-476-r47.ubi.mtd.tar |
24197120 |
d8a55c8b3ce9798f82857c12d921b3b8 |
witherspoon-IBM-OP9_v2.0.15_1.38.pnor.squashfs.tar | 22927360 | cf2ad1867cb542146a526ff45d0baa0b |
|
|
|
WSP_CUDA_BCM5719_FWUPG_GUIDE.txt | 6949 | d549a599b95280d4cb7995f3ea436b09 |
fix_bcm_5719_crc.py | 4344 | 9fa2c74a376aa7aa139deec13b865aa9 |
python3_fix_bcm_5719_crc.py | 4352 | 81ee6fa3f80c0ddb5f9ab594e65011c0 |
lnxfwupg.zip | 1398735 | 23cb464558fc532b24a3106f05bf2ac4 |
nx1_ncsi_v1.4.22_PointDrop.zip | 75049 | 5a1616b6f1af3ab3e11bc3940fac1c0c |
Note: The Checksum can be found by running the Linux/Unix/AIX md5sum command against the Hardware Platform Management (hpm) file (all 32 characters of the checksum are listed), ie: md5sum <filename>
After a successful update to this firmware level, the PNOR components and BMC should be at the following levels.
To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"
And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".
Note: FRU information for the PNOR level does not show the updated levels via the fru command until the system has been booted once at the updated level.
PNOR firmware level: driver content
display pnor FW level using this cmd: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"
IBM-witherspoon-OP9_v2.0.15_1.38
op-build-v2.0.15-506-gad8e70f
buildroot-2018.05.1-9-gc99f2ee
skiboot-v6.0.24
hostboot-f3e13b8-pf8d7f51
occ-a07cae7
linux-4.17.12-openpower1-p1739bb2
petitboot-v1.7.5-p6d315b1
machine-xml-94a137f
hostboot-binaries-d0a77a4
capp-ucode-p9-dd2-v4
sbe-de3e1d7
openBMC level:
display BMC FW level via ssh session on the BMC , using this cmd root@witherspoon:~# cat /etc/os-release
id: openbmc-phosphor
name: Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)
version: ibm-v2.3
version_id: ibm-v2.3-476-g2d622cb-r47-0-g1a24cf77d2
pretty_name: Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) ibm-v2.3
build_id: ibm-v2.3-476-g2d622cb-r47
OP910 | |
OP910.70
OP9_v2.0.15_1.38 / BMC ibm-v2.3-476-r47
08/15/22
| Impact: Availability Severity: SPE
A problem was fixed for a failed correctable error recovery for a DIMM that causes a flood of SRC BC81E580 error logs and also can prevent dynamic memory deallocation from occurring for a hard memory error. This is a very rare problem caused by an unexpected number of correctable error symbols for the DIMM in the per-symbol counter registers.
|
OP910.60
OP9_v2.0.15_1.34 / BMC ibm-v2.3-476-r46
12/01/21
| Impact: Data Severity: HIPER
New features and functions
Linux-aspeed updated to version v5.4.119.
System firmware changes that affect all systems
HIPER/Non-Pervasive: A problem was fixed for a potential problem with I/O adapters that could result in undetected data corruption.
A security problem was fixed for the BMC that allows remote attackers to make observations that help to obtain sensitive information about the internal state of the network Random Number Generator (RNG). This vulnerability is CVE-2020-16166.
A security problem was fixed for the BMC for a flaw in ICMP packets in the Linux kernel that may allow an attacker to quickly scan open UDP ports. This flaw allows an off-path remote attacker to effectively bypass source port UDP randomization. This vulnerability is CVE-2020-25705.
A problem was fixed for excessive and continuous Self Boot Engine (SBE) timer requests that prevent the SBE from processing normal operation messages. With this fix, the number of continuous timer updates is limited and there is a wait for the timer expiry interrupt to restart sending timer requests.
A problem was fixed for Self Boot Engine (SBE) timer requests expiring immediately when re-trying canceled timer requests. This can cause delays in SBE operations and excessive messaging to the SBE.
A problem was fixed for not checking for busy timer messages to the Self Boot Engine (SBE) that can result in lost timer messages, causing the need to resend messages to the SBE.
A problem was fixed for the sensor for GPU memory temperature being unavailable. This is triggered if the OCC is unable to read a GPU memory temperature sensor at some point during the IPL.
A security problem was fixed for the BMC HTTPS web server that could allow an unauthenticated user to obtain sensitive information. This Common Vulnerabilities and Exposures issue number is CVE 2021-38960.
|
OP910.51
OP9_v2.0.15_1.30 / BMC ibm-v2.3-476-r32.2
09/30/21
| Impact: Security Severity: HIPER
System firmware changes that affect all systems
HIPER/Pervasive: A security problem was fixed that allowed a network attacker to use specially crafted IPMI messages to bypass authentication and gain full control of the system. This is security vulnerability CVE-2021-39296. |
OP910.50
OP9_v2.0.15_1.30 / BMC ibm-v2.3-476-r32
01/25/21
|
Impact: Availability Severity: SPE
System firmware changes that affect all systems
A problem was fixed for a system HMI that can occur if a GPU Address Translation Request (ATR) exceeds the time out period. With the fix, the timeout period was extended to allow worst case extra time for memory accesses if the data was not in the cache.
A problem was fixed for SRC BC70E540 being logged during memory diagnostics in the IPL with no hardware FRU being called out for replacement. This SRC has description "mcb(n0p0c1) (MCBISTFIR[12]) WAT_DEBUG_ATTN". This is a false error log and it may be ignored.
|
OP910.40
OP9_v2.0.15_1.19 / BMC ibm-v2.3-476-r32
05/13/20
| Impact: Availability Severity: SPE
System firmware changes that affect all systems
A problem was fixed for a rare IPL failure with SRCs BC8A090F and BC702214 logged caused by an overflow of VPD repair data for the processor cores. A re-IPL of the system should recover from this problem.
A problem was fixed for a rare IPL failure with one of the following symptoms: 1) An eSEL is logged stating an IPMI timeout occurred. 2) A Hostboot timeout error occurs with a message like "Unrecoverable Hardware Failure, (Critical) Hostboot procedure callout.". 3) An eSEL is logged: "FQPSPCR0023M | <timestamp> | Yes | Critical | Hostboot has become unresponsive | None". A re-IPL of the system should recover from this problem. The IBM Knowledge Center document on this problem is titled " FQPSPCR0023M" and can be found at the following link: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9ia7/FQPSPCR0023M.htm
A problem was fixed for failed FRUs associated with checkstops in hostboot not being guarded. This is an intermittent timing problem related to error log entries that have been created but not written or flushed yet at the time of the guard processing.
A problem was fixed for handling On-Chip Controller (OCC) UE errors so that the OCC can reset without terminating the system. Without the fix, the system will checkstop on any OCC UE that requires a reset of the OCC. A re-IPL of the system will recover from this error.
A problem was fixed for an IPL failure caused by an IPMI timeout. This is a rare problem and the re-IPL after the failure recovers from the problem. The eSEL for the IPMI timeout may be ignored.
A problem was fixed for intermittent recoverable errors on the LPC (Low Pin Count) bus being falsely shown as Unrecoverable Errors (UEs). This can occur for some recoverable errors that happen when writing to the serial console. Even though the recovery works, the status for the recovery is misread, causing it to be shown as unrecoverable.
A problem was fixed for a Linux host crash caused by a CPU hard lockup. This can occur when the BMC is rebooted and the normal unresponsiveness of the BMC during this reboot time triggers a time-out on the Pervasive Interconnect Bus (PIB) and a subsequent CPU lockup. With the fix, the host code will do PIB resets and retries until the BMC has completed its reboot and is responsive again to commands.
A problem was fixed for a GPU being reset from an error state remaining fenced and not usable.
A problem was fixed to prevent printing a null when parsing and formatting VPD data.
A problem was fixed for a hang in the OS reboot caused by a TOD failure.
A problem was fixed to increase the severity of abnormal reboot events in the error log.
A problem was fixed in IPMI for a case where a pointer continued to be used after it was freed, causing an intermittent fail if the memory was reused.
A problem was fixed for eSEL PANIC logs not always being sent to the BMC before taking the system down with the error, causing the loss of a log.
A problem was fixed for an OS boot that fails because the BMC is going through a reboot itself. The OS boot can fail when it needs to use BMC services when accessing the flash memory. This can happen if the BMC is not ready to receive commands. With the fix, the boot waits for the BMC to become ready instead of failing immediately on the errant flash access.
A problem was fixed for a fast reboot of the OS failing if VFs (Virtual Functions) were enabled and disabled before the reboot.
A change was made to enable OS software checkstops by default. This prevents hangs of multiple hours in failed reboots if the CPUs become stuck at the start of a kdump.
A problem was fixed for bad flashes caused by data size of memory to flash not being block aligned. This error can intermittently cause partial data to be written to the flash.
|
OP910.31
OP9_v2.0.14_1.2 / BMC ibm-v2.3-476-r32
04/05/19
| Impact: Data Severity: HIPER
System firmware changes that affect all systems
HIPER/Pervasive: A problem was fixed where, under certain conditions, a Power Management Reset (PM Reset) event may result in undetected data corruption. PM Resets occur under various scenarios such as power management mode changes between Dynamic Performance and Maximum Performance, power management controller recovery procedures, or system boot.
A problem was fixed for false processor core failures with SRCs BC131705 and BC8A090F logged. To recover, reboot the system as these are cores intermittently falsely reporting as failed during the IPL.
A problem was fixed for IPMI power down and power on raw commands failing when issued in IPMI Restriction Mode. For this error, the host goes unresponsive with the following SEL list and message logged to the BMC gui after the raw commands are issued from ipmitool: root@powerkvm3-lp1:~# ipmitool -I lanplus -H 9.40.192.54 -P 0penBmc sel list 195 | 02/05/2019 | 09:48:19 | System Event #0x01 | Undetermined system hardware failure | Asserted 196 | 02/05/2019 | 10:36:54 | System Event #0x01 | Undetermined system hardware failure | Asserted 197 | 02/05/2019 | 10:40:53 | System Event #0x01 | Undetermined system hardware failure | Asserted And the following error logged in the BMC GUI: FQPSPCR0023M: Hostboot has become unresponsive _PID=2868 MESSAGE=org.open_power.Host.Boot.Error.WatchdogTimedOut
A problem was fixed for not being able to change the "IPMI admin" password away from the BMC default. With the fix, the "IPMI admin" password can be changed using the ipmitool command. Note: The "IPMI admin" password is independent of the "normal admin" password on the BMC, such as that used by the REST APIs. When REST is used to change the admin password, it is changing the "normal admin" user ID password, not the "IPMI admin" user ID password. Consider changing both of the "admin" user passwords to provide better security. The following is an example of using the ipmitool command to change the "IPMI admin" user ID password (The "1" represents userid 1, which corresponds to the " IPMI admin" user): ipmitool user set password 1 your-new-IPMI-admin-password
Support was added to recognize a port parameter in the URL path for the Preboot eXecution Environment (PXE) in the ethernet adapters. Without the fix, there could be PXE discovery failures if a port was specified in the URL for the PXE.
A problem was fixed for a skiboot hang that could occur rarely for a i2C request if the i2c bus is in error or locked by the On-Chip Controller (OCC).
A problem was fixed for "Unexpected TCE size" error messages when Linux tried the default P9 PHB4 pages size and used the unsupported 2M and 1G page sizes. The TCE page size property is now set correctly with 4K/64K/16M and 256M supported.
A problem was fixed for PCIe ECC protection in the response data path for Power 9 processor parts. With the fix, PCIe ECC errors detected from the adjacent AIB (Adapter Interface Board) receive data path escalate to a checkstop so that the defective parts can be replaced.
A problem was fixed for an intermittent rare processor core lock failure that is not a real hardware problem. The erroneous failure looks like this in the logs: LOCK ERROR: Releasing lock we don't hold depth @0x30493d20 (state: 0x0000000000000001) [13836.000173140,0] Aborting! CPU 0000 Backtrace: S: 0000000031c03930 R: 000000003001d840 ._abort+0x60 S: 0000000031c039c0 R: 000000003001a0c4 .lock_error+0x64 S: 0000000031c03a50 R: 0000000030019c70 .unlock+0x54 S: 0000000031c03af0 R: 000000003001a040 .drop_my_locks+0xf4
A problem was fixed for the power-capping range allowed for the user. Changes were made to allow the user to access the entire powercap range, with two minimums exported into the OS: soft power cap minimum "powercap-min" and the hard power cap minimum limit "powercap-hard-min".
A problem was fixed for an OS reboot after a shutdown that intermittently fails after the shutdown. This can happen if the BMC is not ready to receive commands. With the fix, the messages to the BMC are validated and retried as needed. To recover from this error, the system can be rebooted from the BMC interface.
A problem was fixed for a kernel hard lock up that could occur if IPMI synchronous messages were sent from the OS to BMC while the BMC was rebooting. For these type of messages, a processor thread remains waiting in OPAL until a response is returned from the BMC.
A problem was fixed for a rare Nest Memory Management Unit (NMMU) hang calling out processor hardware incorrectly, masking the real cause of the problem which is an NPU failure. The incorrect error messages take this form on the system: 3 | FQPSPPU0093G | 2018-10-01 01:25:40 | Yes | Warning | CPU 1 has exceeded a correctable error threshold 4 | FQPSPPU0093G | 2018-10-01 03:20:55 | Yes | Warning | CPU 0 has exceeded a correctable error threshold 5 | FQPSPAA0008M | 2018-10-01 04:35:40 | Yes | Critical | Hostboot procedure callout
A problem was fixed for intermittent user and user level privilege errors for the OS ipmipower command. The following error message is issued: " privilege level cannot be obtained for this user".
|
OP9_v2.0.11_1.8/OP910.30
02/13/19 | Impact: Security Severity: SPE
New features and functions
Support has been re-enabled for erepair spare lane deployment for fabric and memory buses.
Support was added for increasing the number of BMC error logs from 100 to 200 and changing the error log to roll over old entries when full instead of stopping the logging of errors. Without this feature, the error log would get full at 100 entries and error logging would be stopped until some of the error logs were purged to make room for new entries.
System firmware changes that affect all systems
A security problem was fixed to prevent a buffer overflow when loading the boot image that could cause firmware corruption. The firmware mitigation adds additional checking of the initial boot firmware image's load size and terminates the boot if the size is too big. The Common Vulnerabilities and Exposures issue number is CVE-2018-1992.
A security problem was fixed to prevent host programs from being able to corrupt the BMC using the internal software bridges between the host and BMC. The Common Vulnerabilities and Exposures issue number is CVE-2019-6260.
A security problem was fixed to detect and prevent Self Boot Engine (SBE) SEEPROM corruption. The Common Vulnerabilities and Exposures issue number is CVE-2018-8931.
A security problem was fixed to prevent a firmware update causing an unsigned image to be activated. The Common Vulnerabilities and Exposures issue number is CVE-2018-13787.
A problem was fixed for an intermittent opal-prd crash that can happen on the host OS. This is the fault signature: " opal-prd[2864]: unhandled signal 11 at 0000000000029320 nip 00000 00102012830 lr 0000000102016890 code 1"
A problem was fixed for a PCI Host Bridge (PHB) configuration write error that caused the incorrect PCIe device to be frozen. The fault will be attributed to the last device to have a memory-mapped I/O operation (MMIO). With this fix, the freeze action for PHB configuration write errors is disabled in order to not impact functional hardware.
A problem was fixed for diagnostic code trying to read sensor values for PCI Host Bridge (PHB) entries that are unused, which causes debug output to have incorrect values for the unused entries. With the fix, only the used entries are processed by the diagnostic code.
A problem was fixed for a IPL loop/hang with a fatal MCE exception log caused by a probe of a failed PCI Host Bridge (PHB) that had been guarded. This is an infrequent error because it requires a PHB to have previously failed. The exception log has the following format: Fatal MCE at 000000003006ecd4 .probe_phb4+0x570 CFAR : 00000000300b98a0 <snip> Aborting! CPU 0018 Backtrace: S: 0000000031cc37e0 R: 000000003001a51c ._abort+0x4c S: 0000000031cc3860 R: 0000000030028170 .exception_entry+0x180 S: 0000000031cc3a40 R: 0000000000001f10 * S: 0000000031cc3c20 R: 000000003006ecb0 .probe_phb4+0x54c S: 0000000031cc3e30 R: 0000000030014ca4 .main_cpu_entry+0x5b0 S: 0000000031cc3f00 R: 0000000030002700 boot_entry+0x1b8
A problem was fixed for repetitive opening of BMC web sockets during a BMC boot causing the websocket either being opened but no data sent, or the inability to establish a new connection.
A problem was fixed for certain system boot failures not propagating to the BMC before the boot firmware shuts down. Some details of the error log may still appear in the console output trace, but the details will not be available with the BMC queries. This problem is timing dependent and intermittently possible depending on the timing of the shutdown path. However, immediate shutdowns exacerbate the problem and increase the chance it can occur.
A problem was fixed for an intermittent error message when activating firmware during a firmware update. This extraneous error message occurred with moderate frequency. This is internal server 500 error message returned on the REST enumerate request. The error message can be ignored as there is not a problem with the firmware activate.
A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=19 during a power off of the system. This intermittent problem is an extraneous errror log and can be ignored.as the power off is successful.
A problem was fixed for an intermittent power on failure with message "Error in mapper call to get service name". To recover from this problem. power cycle the BMC and try the boot again.
A problem was fixed for not being able to set the Power Supply Redundancy by using a REST API command. Without the fix, this was a read-only attribute.
A problem was fixed for a re-IPL failure of OPAL with BB821410 logged. This is an intermittent and infrequent error that can occur if Skiboot fails to get notified of the BMC mailbox shutdown prior to the re-IPL attempt. The problem can be circumvented by doing another IPL.
A problem was fixed for an intermittent IPL failure with BC131705 and BC8A1703 logged with a processor core called out. This is a rare error and does not have a real hardware fault, so the processor core can be unguarded and used again on the next IPL.
A problem was fixed for not always being able to change the BMC root password with the OpenBMC GUI. However, the root password can be changed using the BMC command line. This problem is very intermittent.
|
OP9_v1.19_1.192/OP910.25
09/25/18 | Impact: Availability Severity: SPE
New features and functions
Support was disabled for erepair spare lane deployment for fabric and memory buses. By not using the FRU spare hardware for an erepair, the affected FRUs may have to be replaced sooner. Prior to this change, the spare lane deployment caused extra error messages during run-time diagnostics. When the problems with spare lane deployment are corrected, this erepair feature will be enabled again in a future service pack.
System firmware changes that affect all systems
A problem was fixed for a MSI-X checkstop in CAPI mode. This occurred intermittently when a DMA from the CAPI device targeted an address lower than 4GB and was confused for a 32-bit MSI operation. This is now avoided by disabling the 32-bit MSI when in CAPI mode.
A performance problem was fixed for certain cases of DMA operations from the GPU to an untranslated virtual memory location. With the fix, as much as a 10X performance improvement can occur for this type of DMA from the GPU.
A problem was fixed for L3 cache calling out a LRU Parity error too quickly for hardware that is still good. Without the fix, ignore the L3FIR[28] LRU Parity errors unless they are very persistent with 30 or more occurrences per day.
A problem was fixed for a processor core hang and checkstop during normal operations. This failure occurs only rarely on a race condition in the processor state machine.
A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL. This is an intermittent problem where the memory can sometimes be recovered by doing another IPL. The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations. This supersedes an earlier fix delivered in OP910.22 that intermittently failed to correct the problem. |
OP9_v1.19.1.189/ OP910.24
08/16/18 | Impact: Availability Severity: SPE
New features and functions
Support was added for 24x7 On-Chip Controller (OCC) counter data collection. It allows a customer to monitor utilization and throughput of memory, buses and other system components. The data it collects is stored in system memory and the firmware provides a call interface for applications to read out this data.
Support was added for parity error checking of the GPU data on the NVLink Datalink Layer (NDL), providing earlier memory fault detection and recovery retries to eliminate transient faults.
System firmware changes that affect all systems
A problem was fixed for a GPU NVLINK writing out of range to a MMIO section of memory with byte-enabled writes that caused a machine check. With the fix, the out of range write is handled (detected) to cause a process core dump, but leaves the system in a usable state.
A problem was fixed for GPU workloads using unified memory with address translation service (ATS) sometimes hanging after resetting the GPUs. The trigger for the failure was putting the NPU in the fenced state via the "NPU Fence State" register with SCOM address 0x5011696. With the fix, the GPU fencing is handled using the NTL (NVLink Transaction Layer) reset register bits instead.
A problem was fixed for NPU log messages that were missing the CPU chip identifiers. With the fix, CPU taking the HMI (Host Maintenance Interrupt) is listed along with the NPU FIR register values.
A problem was fixed for the On-Chip Controller (OCC) not being able communicate to the GPUs for thermal monitoring or power capping.. This means the GPUs could overheat or consume too much power for the configuration. The GPUs will continue to operate with the last power cap that was sent. The fans will increase to the maximum speed while in this mode where the OCC cannot read the GPU temperatures.
A problem was fixed for the On-Chip Controller inadvertently disabling the MMIO ATSD flush bits, thereby potentially reducing the performance of the address translation service (ATS) unified memory for the GPU.
A problem was fixed for user applications timing out on the GPU operation for accessing the address translation service (ATS) unified memory, causing an HMI and system termination. With the fix, the ATSD timeout has been disabled, so the user applications can wait for GPU read or write operations to be completed without regard for the time needed for the operation.
A problem was fixed for the SBE timer being stuck and unavailable to the host applications. This forces OPAL to use legacy timer loops for timers at the cost of additional processor bandwidth. Here are the messages that are logged for the problem that occurs on every boot: [ 194.494559313,3] SBE: Timer stuck, falling back to OPAL pollers. [ 194.494624185,3] SBE: You will likely have slower I2C and may have experienced increased jitter.
A problem was fixed for PCIe4 CX5 adapter performance with an increase of performance of 40% for DMA read requests. The adapter affected is the Mellanox CX5 PCIe4 100Gb IB CAPI with feature codes #EC62 with CCIN 2CF1 and #EC64 with CCIN 2CF2. Without the fix, each read request requires a retry to work.
A problem was fixed for user code running on a GPU that can perform invalid commands to the MMIO space and cause an HMI that brings down the system. With the fix, ill-formatted commands to the MMIO space from the GPU will not be processed as a fatal exception but responses will be set to 0xFFFFFFFF and the GPU will receive a normal response code. The user GPU application can look for the bad response and fail, but the system will continue running without taking an HMI, allowing all other workloads to continue normally.
A problem was fixed in Petitboot V1.7.2 for Petitboot exiting to the shell with xCAT genesis in the menu when trying to do a network boot. Petitboot was timing out when trying to access the ftpserver but it was not doing the network re-queries necessary for a proper retry. If this error happens on a system, it can be made to boot with the following two steps: 1) Type the word "exit" and press enter key. This brings it back to petitboot menu. 2) Press the enter key again to start the boot of the xCAT image.
|
OP9_v1.19.1.172 / OP910.22
06/22/18 | Impact: Data Severity: HIPER
New features and functions
Support was added to provide the processor VPD data for the serial number and part number on the host OS. The information can be found in the /proc/device-tree/vpd/root-node-vpd directory path. For example, the following directory path contains the serial-number file for a processor: " /proc/device-tree/vpd/root-node-vpd@a000/enclosure@1e00/backplane@800/processor@1000/serial-number".
System firmware changes that affect all systems
HIPER/Pervasive: A firmware change was made to address a rare case where a memory correctable error on POWER9 servers may result in an undetected corruption of data.
A problem has been fixed to not guard processor cores on memory checkstop errors resulting from a GPU failure. If this problem occurs, the processor cores can be restored by manually clearing the guard records.
A problem has been fixed for the NPU register data logging to include critical information for NVLINK failures such as the CPU chip identifiers. This information is needed to be able to isolate the cause of the NVLINK faults.
A problem has been fixed for systems unexpectedly running with all processors at lower frequencies than would be expected for Workload Optimized Frequency (WOF) ultra-turbo mode. There was no eSEL or callout for the processor causing the error that disabled the WOF mode. With the fix, there is an eSEL and callout for the WOF fault that identifies the errant processor that needs to be replaced.
A problem has been fixed for a PCIe adaper running in CAPP mode having a missing MMIO Base Address Register (BAR) entry that causes a failure of the adapter and a fence off of two of the four ports of the adapter.
A problem has been fixed for a slow start up of a process that can occur when the system had been previously in an idle state.
A problem has been fixed for a TOD error that can cause a soft lockup of the kernel. A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run. The current stack trace is displayed upon detection and, by default, the system will stay locked up.
A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL. This is an intermittent problem where the memory can sometimes be recovered by doing another IPL. The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations.
A problem was fixed for a processor core that cannot be awakened or a timeout in the On Chip Controller when switching Workload Optimized Frequency (WOF) modes from disabled to enabled. These errors can cause a reduction in performance by running with fewer cores or by running at the safe mode frequencies.
|
OP9_v1.19.1.160 / OP910.21
05/18/18 | Impact: Availability Severity: SPE
New features and functions
Support to enable Call Home ESELs to allow system data such as On-Chip Controller (OCC) telemetry to be collected remotely.
Support has been removed from XIVE interrupt controller for the store EOI operation. Hardware has limitations which would require a sync after each store EOI to make sure the MMIO operations that change the ESB state are ordered. This would be performance prohibitive and the PCI Host Bridges (PHBs) do not support the synchronization.
System firmware changes that affect all systems
A problem was fixed for extraneous error logging and console messages for nonexistent NPU registers whenever a processor error occurs.
A problem was fixed for a false call out of a processor on a INTCQFIR[27]. This FIR bit should not call out the processor as the processor has not failed. The error is recoverable and should only serve as an early warning indication.
A problem was fixed for transactional memory that could result in a wrong answer for processes using it. This is a rare problem requiring L2 cache failures that can affect the process determining correctly if a transaction has completed.
A problem was fixed for Workload Optimized Frequency (WOF) where parts may have been manufactured with bad IQ data that requires filtering to prevent WOF from being disabled.
A problem was fixed for the opal-prd service consuming 100% of CPU during and after boot to the host. This is an infrequent intermittent problem that can be circumvented by a reboot of the system.
A problem was fixed for VRMs drawing current over the specification. This occurred whenever heavy work loads went above 372 amps with WOF enabled. At 372 amps, a rollover to value "0" for the current erroneously occurred and this allowed the frequency of the processors in the system to exceed the normally expected values.
A problem was fixed for the wrong DIMM being called out on over-temperature failures with B1xx2A30 errors logged This should be a rare failure as it requires a DIMM to exceed its maximum specified operating temperature.
A problem has been fixed to clean up memory after a GPU has failed. This fix fences off failed GPUs on a GPU reset. The fencing ensures that access to memory behind the links will not lead to HMIs. but instead SUE's will be populated in cache. Before installing this fix, the NVIDIA Tesla driver must be updated in the Linux OS to version level 396.26 as a prerequisite. Feature #EC4J provides the NVIDIA Tesla V100 GPU with NVLINK Air-Cooled (16 GB) that requires the updated driver. Without this driver update, a GPU that has faulted and gone through a GPU reset can cause a Terminate Immediate (TI) or HMI for the system. The Tesla CUDA driver can be obtained at the direct NVIDIA link of "http://www.nvidia.com/download/driverResults.aspx/134380/en-us":
TESLA DRIVER FOR LINUX POWER RHEL 7 Version: 396.26 Release Date: 2018.5.17 Operating System: Linux POWER LE RHEL 7 CUDA Toolkit: 9.2 Language: English (US) File Size: 47.26 MB
A problem has been fixed to add part and serial numbers to the processors when accessed through the device tree.
A problem has been fixed to make the OS aware of the DARN random number generator at 0x00200000 PPC_FEATURE2_DARN) and the SCV syscall at 0x00100000 (PPC_FEATURE2_SCV). Without this fix, these service constants are not defined in the OS userspace.
|
OP9_v1.19.1.154 / OP910.20
04/18/18 | Impact: Availability Severity: SPE
This Service Pack includes updates in response to Recent Security Vulnerabilities, New Features & Functions and System Firmware Updates. Details of each are below:
Response for Recent Security Vulnerabilities
In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2017-5754 with firmware initializations augmenting an earlier fix provided in FW level OP910.10. Operating System updates are required in conjunction with the new FW level for addressing CVE-2017-5754.
New features and functions (not related to above CVE)
Support for voltage-droop monitors (VDM) to provide for improved system reliability during periods of unstable voltage from the power supply. The P9 processor uses an adaptive clock strategy to reduce the system power usage during power supply droop events by embedding analog VDMs that direct a digital phase-locked loop (DPLL) to immediately reduce clock frequency in response to the droop event
Support for Workload Optimized Frequency (WOF). This feature provides the maximum processor frequency in order to increase system performance based on workload characteristics.
Support was added for using "ipmitool mc info" from the host OS to get the BMC firmware level.
Support was added to increase the number of NPU2 register contents dumped for NVLINK Hypervisor Maintenance Interrupts (HMIs) and to add logging for the HMI actions.
Support was added to make the Self Boot Engine (SBE) fault indicator bits recoverable. This means if a SBE seeprom error occurs, recovery action will be taken to prevent an IPL failure or system outage.
System firmware changes that affect all systems
A problem was fixed for an On-Chip Controller (OCC) not going active caused by a race condition in the initialization of the OCCs. This problem is intermittent and can be resolved by a re-IPL of the system.
A problem was fixed for the BMC journal file getting overwritten with network change notifications when there is a IPv6 router in the local subnet. A problem was fixed for the BMC version fields not being set as shown by "ipmitool mc info" and the Petitboot System Information UI. The BMC can be accessed by SSH (secured shell) login and the following command run to show the BMC firmware level: "cat /etc/os-release". Look for the "VERSION=" string that has the BMC version identifier appended to it.
A problem was fixed for the display of the power supply output outage that in one instance was showing as 390V instead of 12V. The voltage is at the right level but recent revisions of the power supply firmware had a change in how the output voltage was calculated, causing the displayed values to read too high.
A problem was fixed for VLAN ID showing as "Disabled" with the "ipmitool lan print 1" after the VLAN was set by inband by the OS. The VLAN is set correctly and functional, but the display of the VLAN information, while initially correct, went to "Disabled" during the first minute after the operation.
A problem was fixed for no amber fault LEDs being lit (or SELs reported) for front or rear fan rotors that have a RPM of zero due to blockage or other hardware error.
A problem was fixed for the host failing during a reset of the BMC when a host to BMC message had a time out. This problem is rare as the host normally stays up and running when the BMC is reset.
A problem was fixed for multi-rotor failures in a fan not causing a system shutdown, making it possible for the system to fail from an overheat condition that could be destructive to other system FRUs. This problem is rare as it requires that more than one rotor fail in a system fan at the same time.
A problem was fixed for a change or enablement of the NTP time server not forcing a network time synchronization, potentially leaving the BMC local time different from the network time. This problem can be circumvented by a reset of the BMC.
A problem was fixed for a BMC reset causing the On-Chip Controller (OCC) to fail and the system going into Safe mode. This is an infrequent problem that is triggered if a BMC reset and a OCC reset happen at the same time such that the BMC is unable to respond to OCC messages, forcing the OCC into a failed state.
A problem was fixed for error log "BC8A2502 - IPMI::RC_INVALID_SENDRECV" occurring during the system IPL. The On-Chip Controller (OCC) error is automatically recovered, so the error log does not impact the system.
A problem was fixed for error log " BC8A2507 - IPMI::RC_SENSOR_NOT_PRESENT" that can occur on a system power on if the BMC was reset at system runtime previously. When the BC8A2507 error occurs, the host uses the default value for the sensor data. The problem will persist for the IPL until the BMC is reset.
A problem was fixed for a power supply FQPSPPW0034M error persisting with enclosure fault LEDs lit even after the power supply problem has been corrected. The fault can be triggered by a momentary loss of AC or by unplugging and plugging AC into the power supply.
A problem was fixed for Coherent Accelerator Processor Proxy (CAPP) mode for the PCI Host Bridge (PHB) to improve DMA write performance by enabling channel tag streaming for the PHB. With this enabled, the DMA write does not have to wait for a response before sending a new write command on the bus.
A problem was fixed for the Open-Power Flash tool "pflash" failing with a blocklevel_smart_erase error during a pflash. This problem is infrequent and is triggered if pflash detects a smart erase fits entirely within one erase block.
A problem was fixed in the Petitboot user interface to handle cursor mode arrow keys for the VT100 'application' cursor to prevent mis-interpreting an arrow key as an escape key in some situations. For more information on the VT100 cursor keys, see http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-21.html.
A problem was fixed in the Petitboot user interface to cancel the autoboot if the user has exited the Petitboot user interface. This prevents the user dropping to the shell and then having the machine boot on them instead of waiting until the user is ready for the boot.
A problem was fixed in the Petitboot parsing of manually-specified configuration files that caused the parser to create file paths relative to the downloaded file's path, not the original remote path.
A problem was fixed for a failure to IPL with SRC BC8A0506 logged for a Phase Lock Loop error (PLL) in the PCIe Host Bridge (PHB). This problem is very infrequent. The fix does the correct call out of the failed FRU, allowing the IPL to continue.
A problem was fixed for a system IPL hang that shows in the log as the host going to a quiesce state with the OS inactive. This is a rare problem that may be recovered by a power off and re-IPL of the system. This problem is triggered by a higher than normal level of interrupts from the Power Supply Unit (PSU).
A problem was fixed for the VPD serial number not being updated on the replacement of a planar. The VPD update failed with the following message: "ERROR: (ECMD): ecmd - 'putvpdkeyword' returned with error code 0x20300001 (ERROR OPENING DECODE FILE). ERROR: A problem occurred updating the serial number(OSYS:SS). Please see previous output for reason ".
A problem was fixed for the CAS latency calculation for memory to improve its accuracy to reduce the potential for DIMM failures due to memory timing errors. Column Access Strobe (CAS) latency is the delay time between the moment a memory controller tells the memory module to access a particular memory column on a RAM module, and the moment the data from the given array location is available on the module's output pins.
A problem was fixed for clearing DIMM guard records when there was a repair marked in the VPD and that prevented the DIMM from being unguarded. With the fix, the VPD mark will be cleared if the guard record is cleared for the FRU, allowing it to be enabled on the next IPL.
A problem was fixed for the Self Boot Engine (SBE) error identification on failure. The SRR0/SRR1/LR/Local FI2C register are now extracted to allow the following SBE errors to now be identified: 100 - Program interrupt , promoted 101 - Instruction storage interrupt, promoted 110 - Alignment interrupt, promoted 111 - Data storage interrupt, promoted
A problem was fixed for read margins to improve the margins on DIMMs, reducing the number of DIMM failure occurrences.
A problem was fixed for the Hostboot reset to enable error recovery of Hostboot through the reset path. Without the fix, the Self Boot Engine (SBE) fails to reboot on the Hostboot reset, preventing error recovery for the Hostboot failures.
A problem was fixed for a memory training error that could caused DIMMS to be marked as bad or memory ports to be deconfigured. This problem is rare and triggered by an incorrect internal voltage level.
A problem was fixed for a Phase Lock Loop (PLL) error causing a checkstop but not calling out and guarding the failed hardware. There is then a chance the failure will recur on the next IPL of the system.
A problem was fixed to reduce memory latency in memory blocks where bad memory bits have been marked.
A problem was fixed for time and date fields being zero in Hostboot error log entries for the time/date of the error occurrence.
A problem was fixed for an extraneous "MCBISTFIR[3]: broadcast out of sync" error during memory diagnostics if a Register Clock Driver (RCD) parity error occurs. The "broadcast out of sync" error should be ignored when isolating the RCD fault. This problem is triggered if the RCD parity error occurs while the DDR4 memory is in broadcast mode.
A problem was fixed for BC8A2AC5 and BC8A2AC4 errors that prevented the reading of On-Chip Controller (OCC) thermal readings from the Analog Power Subsystem Sweep (APSS) bus. This is a very rare problem.
A problem was fixed for a processor fault that caused the master processor core to guard and prevented an IPL of the system with SRC BC13E540. With the fix, the system will IPL up on the available processor cores. This error only occurs if the master core is faulted. Faults on the other cores are handled correctly and do not stop an IPL.
A problem was fixed for a flood of OPAL error messages that can occur for a processor fault. The message "CPU ATTEMPT TO RE-ENTER FIRMWARE" appears as a large group of messages and precede the relevant error messages for the processor fault. A reboot of the system is needed to recover from this error.
|
OP9_v1.19_1.111 / OP910.10 | Impact: Availability Severity: SPE
This Service Pack includes updates in response to Recent Security Vulnerabilities, New Features & Functions and System Firmware Updates. Details of each are below:
Response for Recent Security Vulnerabilities
In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754. Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754.
New features and functions (not related to above CVE’s)
Support was added for increasing the number of BMC error logs from 100 to 200 and changing the error log to roll over old entries when full instead of stopping the logging of errors. Without this feature, the error log would get full at 100 entries and error logging would be stopped until some of the error logs were purged to make room for new entries.
Support was added to enable power supply redundancy.
Enable air-cooled fan control to optimize fan speeds for the temperature conditions and improve fan speed control to minimize fan speed oscillation.
Support was added for advanced power supply fault monitoring to improve fault isolation, error detection, and reliability.
Support was added for forcing a new dump type, Checkstop, if the host has a checkstop. Without this new dump, critical debug information is missing because the /var/lib/obmc_console.log does not
System firmware updates (not related to above CVE’s)
A problem was fix for intermittent processor core hangs that caused checkstops with code "NCU no response to snooped TLBIE".
A problem was fixed for fans being reported as "Nonfunctional". This error occurred during peak loads on the BMC that tripped a watchdog process, causing the fans to speed up to the maximum speed. An error in the fan recovery to normal speed resulted in the "Nonfunctional" status.
A problem was fixed for a processor replacement that caused extra cores to be reported as present that do not exist. This happens if the new processor has fewer cores than the processor that is being replaced. This problem can be recovered by doing a factory reset on the BMC.
A problem was fixed for GPU temperatures not being reported on systems that have maximum DIMM configurations for the memory. Without the fix, reducing the number of DIMMs plugged in would make available On-Chip Controller (OCC) slots for missing GPU temperatures to be reported.
A problem was fixed for the host time inadvertently changing when a BMC time change is requested in NTP mode with Split ownership. The problem can be recovered by IPLing to the host and the NTP server will correct the host time.
A problem was fixed for the host time skewing ahead in time after time ownership is split and the clock has been set from the BMC. The problem can be recovered by setting the correct time from the host.
A problem was fixed to reject the use of the path /org/openbmc on the REST API URIs. This affects the API /org/openbmc/sensors/host/PowerSupplyRedundancy which is no longer valid.
A problem was fixed for the BMC REST server going into a retry hang with the BMC becoming unresponsive when given a REST command with a bad data format. Without the fix, the REST server will repeatedly retry the bad command, causing a denial of service for all other users of the BMC.
A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=11 during a IPL. This intermittent problem was caused by an overflow of the total system power value from the OCC. The system can be recovered by retrying the IPL.
A problem was fixed for the ECC error recovery. Error recovery was not working and the ECC errors would prevent the boot.
A problem was fixed for an intermittent power on failure with message "Error in mapper call to get service name". To recover from this problem. power cycle the BMC and try the boot again.
A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=19 during a power off of the system. This intermittent problem is an extraneous errror log and can be ignored.as the power off is successful.
A problem was fixed for an intermittent error message when activating firmware during a firmware update. This extraneous error message occurred with moderate frequency. This is internal server 500 error message returned on the REST enumerate request. The error message can be ignored because there is not a problem with the firmware activate.
A problem was fixed for the power button LED not blinking when in the standby state (not powered on). Without the fix, the power button always has a solid green LED, regardless of power on or power off state.
A problem was fixed for intermittent host checkstops caused by NCU and PCI time-out mismatches. PCI timeouts that are longer than NCU timeouts may cause checkstops on the host.
|
OP9_v1.19_1.94 / OP910.00 | Impact: New Severity: New New features and functions for MTM 8335-GTG: GA Level |
OS levels supported by the LC 8335 servers:
- Minimum level is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9), also known as RHEL 7.5-ALT LE, with third Z-stream or later (https://access.redhat.com/errata/RHBA-2018:2467 has the needed kernel "kernel-alt-4.14.0-49.10.1.el7a.src.rpm" ). The recommended level is RHEL 7.5-ALT LE, with fourth Z-stream or later (https://access.redhat.com/errata/RHSA-2018:2772 has the needed kernel "kernel-alt-4.14.0-49.13.1.el7a.src.rpm" ).
This RHEL level has fixes for ATS (Address Translation Service) for improved performance for the GPU access of memory.
- NVIDIA Telsa CUDA recommended driver level 396.44 or later, or minimum driver level 396.26 from the CUDA 9.2 toolkit
- Broadcom Ethernet driver level for the BCM5719 I/O adapter of 5719-v1.43 NCSI v1.4.22.0 or later.
IBM Power LC 8335 servers supports Linux which provides a UNIX like implementation across many computer architectures. Linux supports almost all of the Power System I/O and the configurator verifies support on order. For more information about the software that is available on IBM Power Systems, see the Linux on IBM Power Systems website:
http://www.ibm.com/systems/power/software/linux/index.html
The Linux operating system is an open source, cross-platform OS. It is supported on every Power Systems server IBM sells. Linux on Power Systems is the only Linux infrastructure that offers both scale-out and scale-up choices.
A supported version of Linux on the Power LC 8335 is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9) (RHEL 7.5-ALT LE).
For additional questions about the availability of this release and supported Power servers, consult the Red Hat Hardware Catalog at
https://access.redhat.com/products/red-hat-enterprise-linux/#addl-arch.
For more information about Linux on Power, see the Linux on Power developer center at https://developer.ibm.com/linuxonpower/
For information about the features and external devices that are supported by Linux, see this website:
http://www.ibm.com/systems/power/software/linux/index.html
Use one of the following commands at the Linux command prompt to determine the current Linux level:
•cat /proc/version
•uname -a
The output string from the command will provide the Linux version level.
The opal-prd package on the Linux system collects the OPAL Processor Recovery Diagnostics messages to log file /var/log/syslog. It is recommended that this package be installed if it is not already present as it will help with maintaining the system processors by alerting the users to processor maintenance when needed.
On Red Hat Linux, perform command "rpm -qa | grep -i opal-prd ". The command output indicates the package is installed on your system if the rpm for opal-prd is found and displayed. This package provides a daemon to load and run the OpenPower firmware's Processor Recovery Diagnostics binary. This is responsible for run-time maintenance of Power hardware. If the package is not installed on your system, the following command can be run on Red Hat to install it:
sudo yum update opal-prd
To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"
And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".
Note: the "cat" commands are run after ssh to the BMC as root and the default password is 0penBmc (where 0 is the zero character).
Follow the instructions on Fix Central. You must read and agree to the license agreement to obtain the firmware packages.
Note 1: If your current firmware level is less than OP910.22, you must first update to OP910.22 and then update to later levels. If the update to the firmware level OP910.22 is skipped, the BMC will fail on the code update and it will be dead. If this happens, IBM Support needs to be contacted so that the BMC card can be replaced. The OP910.30 and later levels require more space for the BMC image, so before updating to these levels, the BMC needs the fix that increases the space for BMC image.
Note 2: Before updating to the OP910.24 or later firmware levels, ensure that the Linux OS is at RHEL 7.5-ALT LE with the third Z-stream or later and the NVIDIA CUDA driver for the NVIDIA Tesla GPUs on the system is at the recommended driver level of 396.44 or later, or the mimimum level 396.26. See "1.4 Required level for NVIDIA CUDA driver for the Tesla V100 GPU" for more information. After the firmware update, ensure that the BCM1579 ethernet driver is updated to level 5719-v1.43 NCSI v1.4.22.0. See "1.5 Required Broadcom Ethernet driver level for the BCM5719". The complete set of update instructions covering the OS, CUDA driver, firmware, and Ethernet driver can be found in the readme guide on Fix Central called "WSP_CUDA_BCM5719_FWUPG_GUIDE.txt".
Note: After updating to this firmware level, it is necessary to do a manual check to validate that the SBE image is correct. Follow the steps in section "7.1 SBE Validation Steps " to complete the check before trying to boot and use the system on the new firmware level.
1.Power off the system if it is not already off
2.Perform BMC and PNOR update
3.Power on system to petitboot menu or OS login prompt
4.Power off the system
5.BMC and PNOR Images Verification Prerequisite:
▪Both must be using primary image, or
▪Both must be using Alternate image
6.Login to openBMC
7.Start SBE Validation by running the command "sbe-validation.sh -verbose".
8.If validation is successful, there will be a SEL and the message "Core root of trust FW verified. No malicious code modification found”.
9.If validation is failed (internally three retries will be done), the message "Core root of trust FW verification failed. Suspicious code found” is displayed.
10.If there is a failure, check the system power status. If system power is off, the system power issue must be verified and resolved, and then run "sbe-validation.sh' again.
11.Power off the system to complete the validation.
12.The time to do the validation is slightly less than three minutes, with power on taking 30 seconds, SBE validation taking 130 seconds, and power off taking about 20 seconds.
root@witherspoon:~# date;sbe-validation.sh;date
Tue Nov 27 19:22:57 UTC 2018
Powering On Chassis.
Running Image Validation tool
SEEPROM corruption detection tool running on Linux at Tue Nov 27 19:23:28 2018
Scanning completed at Tue Nov 27 19:25:31 2018
Corruption was not detected
Powering Off Chassis
Tue Nov 27 19:25:45 UTC 2018
root@witherspoon:~# sbe-validation.sh -verbose
Powering On Chassis.
Running Image Validation tool
SEEPROM corruption detection tool running on Linux at Mon Oct 15 18:07:45 2018
Scanning completed at Mon Oct 15 18:08:13 2018
SEEPROM corruption detection tool running on Linux at Mon Oct 15 18:08:13 2018
Scanning completed at Mon Oct 15 18:08:42 2018
SEEPROM corruption detection tool running on Linux at Mon Oct 15 18:08:42 2018
Scanning completed at Mon Oct 15 18:09:10 2018
Corruption was detected. IBM recommends powering down the host. Call IBM service for corrective action
Powering Off Chassis
Link: https://<bmc ip>/#/server-health/event-log
Login: curl -c cjar -k -X POST -H "Content-Type: application/json" -d '{"data": [ "root", "0penBmc" ] }' https://bmc_ip/login
Get Log: curl -c cjar -b cjar -k -H "Content-Typeapplication/json" -X GET https://bmc_ip/xyz/openbmc_project/logging/entry/enumerate
Output for a success run:
"/xyz/openbmc_project/logging/entry/5": {
"AdditionalData": [
"_PID=16231"
],
"Description": "Firmware security validation passed",
"EventID": "FQPSPSE0066I",
"Id": 1,
"Message": "xyz.openbmc_project.SBE.SEEPROM.Error.ValidationPass",
"Purpose": "xyz.openbmc_project.Software.Version.VersionPurpose.BMC",
"Resolved": false,
"Severity": "xyz.openbmc_project.Logging.Entry.Level.Informational",
"Timestamp": 1543343550693,
"Version": "ibm-v2.3-476-g2d622cb-r22-0-g7878dc8",
"associations": [] }
Output for a failure run:
"/xyz/openbmc_project/logging/entry/17": {
"AdditionalData": [
"_PID=1903"
],
"Description": "An internal BMC error occurred",
"EventID": "None",
"Id": 17,
"Message": "xyz.openbmc_project.SBE.SEEPROM.Error.ValidationFail",
"Purpose": "xyz.openbmc_project.Software.Version.VersionPurpose.BMC",
"Resolved": false,
"Severity": "xyz.openbmc_project.Logging.Entry.Level.Error",
"Timestamp": 1539626950885,
"Version": "ibm-v2.3-476-g2d622cb-r7-0-gd7d4a34",
"associations": []
The updating and upgrading of system firmware depends on several factors, such as the current firmware that is installed, and what operating systems is running on the system.
These scenarios and the associated installation instructions are comprehensively outlined in the firmware section of Fix Central, found at the following website:
http://www.ibm.com/support/fixcentral/
Any hardware failures should be resolved before proceeding with the firmware updates to help insure the system will not be running degraded after the updates.
The process of updating firmware on the OpenBMC managed servers is documented below.
The sequence of events that must happen is the following:
•Power off the Host
•Update and Activate BMC
•Update and Activate PNOR
•Reboot the BMC (applies new BMC image)
•Power on the Host (applies new PNOR image)
The OpenBMC firmware updates (BMC and PNOR) for the LC 8335 servers can be managed via the command line with the openbmctool.
The openbmctool is obtained using the IBM Support Portal.
1.Go to the IBM Support Portal.
2.In the search field, enter your machine type and model. Then click the correct product support entry for your system.
3.From the Downloads list, click the openbmctool for your machine type and model.
4.Follow the instructions to install and run the openbmctool. You will need to provide the file locations of the BMC firmware image tar and PNOR firmware image tar that must be downloaded from Fix Central for the update level needed.
Information on the openbmctool and the firmware update process can be found in the IBM Knowledge Center:
https://www.ibm.com/support/knowledgecenter/POWER9/p9ei8/p9ei8_update_firmware_openbmctool.htm .
The service processor, or baseboard management controller (BMC), provides a hypervisor and operating system-independent layer that uses the robust error detection and self-healing functions that are built into the POWER processor and memory buffer modules. Open power application layer (OPAL) is the system firmware in the stack of POWER processor-based Linux-only servers.
The service processor, or baseboard management controller (BMC), is the primary control for autonomous sensor monitoring and event logging features on the LC server.
The BMC supports the Intelligent Platform Management Interface (IPMI) for system monitoring and management. The BMC monitors the operation of the firmware during the boot process and also monitors the OPAL hypervisor for termination.
The Open Power Abstraction Layer (OPAL) provides hardware abstraction and run time services to the running host Operating System.
For the 8335 servers, only the OPAL bare-metal installs can be used.
Find out more about OPAL skiboot here:
https://github.com/open-power/skiboot
The Intelligent Platform Management Interface (IPMI) is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The LC 8335 servers provide one 10M/100M baseT IPMI port.
The ipmitool is a utility for managing and configuring devices that support IPMI. It provides a simple command-line interface to the service processor. You can install the ipmitool from the Linux distribution packages in your workstation, sourceforge.net, or another server (preferably on the same network as the installed server).
For installing ipmitool from sourceforge, please see section 1.1 "Minimum ipmitool Code Level".
For more information about ipmitool, there are several good references for ipmitool commands:
The man page
The built-in command line help provides a list of IPMItool commands:
# ipmitool help
You can also get help for many specific IPMItool commands by adding the word help after the command:
# ipmitool channel help
For a list of common ipmitool commands and help on each, you may use the following link:
www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabpcommonipmi.htm
To connect to your host system with IPMI, you need to know the IP address of the server and have
a valid password. To power on the server with the ipmitool, follow these steps:
1. Open a terminal program.
2. Power on your server with the ipmitool:
ipmitool -I lanplus -H bmc_ip_address -P ipmi_password power on
3. Activate your IPMI console:
ipmitool -I lanplus -H bmc_ip_address -P ipmi_password sol activate
Petitboot is a kexec based bootloader used by IBM POWER9 systems for doing the bare-metal installs on the 8335 servers.
After the POWER9 system powers on, the petitboot bootloader scans local boot devices and network interfaces to find boot options that are available to the system. Petitboot returns a list of boot options that are available to the system. If you are using a static IP or if you did not provide boot arguments in your network boot server, you must provide the details to petitboot. You can configure petitboot to find your boot with the following instructions:
https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootadvanced.htm
You can edit petitboot configuration options, change the amount of time before Petitboot automatically boots, etc. with these instructions:
https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootconfig.htm
After you select to boot the ISO media for the Linux distribution of your choice, the installer wizard for that Linux distribution walks you through the steps to set up disk options, your root password, time zones, and so on.
You can read more about the petitboot bootloader program here:
https://www.kernel.org/pub/linux/kernel/people/geoff/petitboot/petitboot.html
This guide helps you install Linux on Power Systems server.
Overview
Use the information found in http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/liabwkickoff.htm to install Linux on a non-virtualized (bare metal) IBM Power LC server.
Date | Description |
08/15/2022 | Updated for OP910.70 |
12/01/2021 | Updated for OP910.60 |
09/28/2021 | Updated for OP910.51 |
01/25/2021 | Updated for OP910.50 |
05/13/2020 | Updated for OP910.40 |
12/04/2019 | Added warning for need to update to OP910.22 first before updating to the new level |
04/05/2019 | Updated for OP910.31 |
02/13/2019 | Updated for OP910.30 |
09/25/2018 | Updated for OP910.25 |
08/14/2018 | Updated for OP910.24 |
06/22/2018 | Updated for OP910.22 |
05/18/2018 | Updated for OP910.21, added driver level for Tesla CUDA driver |
04/18/2018 | Updated for OP910.20 |
03/22/2018 | Corrections for OP910.10 |
01/18/2018 | Updated for AC922 only for OP910.10 |
12/22/2017 | New for LC server OP910.00 release |