Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 5.0.x applies for all supported platforms.

Problems fixed in IBM Spectrum Scale 5.0.3.3 [September 12, 2019]

Item: IJ18076
Problem description: A race between the thread handling dm_create_session and an mmchmgr command caused a new DMAPI session id not to be sent to the new GPFS configuration manager node. When a dm_destroy_session is called to destroy that session id it failed with the error EINVAL since the new configuration manager node doesn't know about that session id.
Work around: None
Problem trigger: Running mmchmgr when dm_create_session is in progress.
Symptom: dm_destroy_session fails with EINVAL error.
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users with DMAPI enabled GPFS file systems
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18605
Problem description: On a RHEL 7.6 node, with GPFS versions 4.2.3.13 or higher or 5.0.2.2 or higher, when the kernel is upgraded to version 3.10.0-957.14.1 or higher, the node may encounter an I/O error when accessing a renamed directory. For example, on the RHEL 7.6 node: cd dir1; on the other cluster node, rename the directory to the new name: mv dir1 dir2. Then, dir2 cannot be accessed on the RHEL 7.6 node.
Work around: On the node, exit from the directory with old name (say, with "cd .."), and access it again by ls -ld; then the new name directory can be accessed
Problem trigger: This issue affects customers running IBM Spectrum Scale V4.2.3.13 or higher and 5.0.2.2 or higher under following scenarios: - upgrade RHEL 7.6 node kernel to version 3.10.0-957.14.1 or higher - access some directory on the RHEL 7.6 node, for example, cd dir1 - rename the directory name to the new name on the other cluster node, for example, mv dir1 dir2 - then, I/O error which occurs when accessing the renamed directory on the RHEL 7.6 node.
Symptom: I/O error
Platforms affected: All RHEL 7.6 OS environments with kernel version high than 3.10.0-957.14.1
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18078
Problem description: A race between a thread handling node recovery and a thread trying to generate a DMAPI event caused an assert because of the status change of the DMAPI event status change.
Work around: None
Problem trigger: Node failure and threads accessing migrated files.
Symptom: GPFS daemon failure
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18100
Problem description: When a filesystem is being quiesced (for example create or delete of a snapshot), a type of rpc labelled "RecLockRetry" can become hung and create a deadlock between nodes. Messages similar to these can be found in the output of the "mmdiag --waiters" command - RemoteRetryThread: on ThCond, reason 'RPC wait' for RecLockRetry - Msg handler RecLockRetry: for In function "RecLockMessageHandler(RecLockRetry)", call to "kxSendFlock"
Work around: None
Problem trigger: Applications running on different nodes in the cluster are contending for conflicting advisory (fcntl) locks on the same file and one of these is releasing its lock at a time when the filesystem is being quiesced (for example, create or delete of a snapshot).
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18102
Problem description: Recovery from previous failure crashes in the kernel due to the same memory being deallocated twice. The linux log reporting the BUG will have these 3 characteristic calls in the stack: [317959.895506] [<001fffff80085cb2>] cxiFreePinned+0x72/0xc0 [mmfslinux] [317959.895515] [<001fffff80087f9c>] cxiFcntlLock+0x54c/0x718 [mmfslinux] [317959.895709] [<001fffff812416ac>] _Z17RecLocModuleResetj+0x1c0/0x358 [mmfs26]
Work around: None
Problem trigger: A previous abnormal GPFS daemon shutdown (crash)
Symptom: Abend/crash
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18315
Problem description: audit.log is showing a deny regarding logrotate for the following logs: cesdr-log:/var/adm/ras/mmcesdr.log fileaudit-log:/var/adm/ras/mmaudit.log mmprotocoltrace:/var/adm/ras/mmprotocoltrace.log mmwatch-log:/var/adm/ras/mmwatch.log mmwfclient-log:/var/adm/ras/mmwfclient.log msgqueue-log:/var/adm/ras/mmmsgqueue.log tswatchmonitor-log:/var/adm/ras/tswatchmonitor.log watchfolders-log:/var/adm/ras/mmwf.log
Work around: Manually fix SELinux security file context to allow logrotate to access the above log files.
Problem trigger: logrotate activity
Symptom: Error output/message
Platforms affected: Linux with SELinux enabled in enforcing mode
Functional Area affected: All Scale Users with components that use the Linux logrotation utility.
Customer Impact: Suggested: has little or no impact on customer operations
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18105
Problem description: LOGASSERTFAILED: CP->GET_STCP() != 0 SHHASHS.C:1689
Work around: None
Problem trigger: Node failure and file deletions
Symptom: GPFS daemon failure
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18309
Problem description: The disk size (number of sectors) was uninitilized after the disk was reopened. It caused an IO error when writing the disk descriptor.
Work around: None
Problem trigger: Stop and start the disk.
Symptom: IO Error
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18109
Problem description: The ACL file contains all the Access Control Lists for the filesystem. In order to maximize parallelism among nodes as they access ACLs, the blocks of this file can be cached on each node. If during the process of reclaiming unused ACL space, deletions are broadcast to nodes, these messages may include an updated count of the number of ACLs that exist, and the nodes receiving this message need to update the header that resides in block 0 of the ACL file. A problem exists where nodes are mis-matched replicas for this block (inode 4 block 0); messages like "Error in inode 4 snap 0: Record block 0 has mismatched replicas". This problem may also cause ACL garbage collection to run too frequently.
Work around: None
Problem trigger: As unique ACLs are created in the fileystem, the ACL file grows and as the number doubles, garbage collection (reclaim of space no longer used) runs automatically (so adding ACLs can be one trigger). Other conditions that contribute to making this issue visible include having some metadata disks down at the time that the collection happens to be running. (This can result in subsequent mmfsck runs finding mis-matched replicas for block 0 of the ACL file, which is inode 4).
Symptom: Error output/message
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18110
Problem description: In a file system a directory is being deleted on a node; from another node in the cluster, some operation generates asynchronous attempt to obtain a conflicting lock to the same directory. This may cause a kernel crash in the pathname look up procedure on the first node. This is a timing issue and difficult to hit.
Work around: None
Problem trigger: A large workload of recursive directory deletion.
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18144
Problem description: Poor readdir performance on a shared directory after workload caused local node to have lock on the entire directory.
Work around: Avoid multiple "ls" on a shared directory.
Problem trigger: Performing repeated readdir and lookup on a shared directory.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18165
Problem description: If the clock jumps forward more than 160 seconds, then the internal command "tsctl nQstatus -Y" will return a status of "unresponsive". This will trigger the gpfs_unresponsive event and causing CES IPs to failover to other nodes.
Work around: None. The transient "unresponsive" state is self corrected within a short period of time (10 seconds).
Problem trigger: System clock jumps backward or forward by more than 160 seconds.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Operating System environments
Functional Area affected: GPFS
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18136
Problem description: Deadlock on SGManagementMgrDataMutex could occur during buffer steal.
Work around: None
Problem trigger: Buffer steal triggered due to running low on free buffers or a change in token manager assignment.
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18137
Problem description: GPFS daemon assert: Assert exp(i < nServers). This could happen when the number of manager nodes in the cluster is more than the maxTokenServers configuration setting which defaults to 128.
Work around: Either reduce number of manager nodes or increase the maxTokenServers setting.
Problem trigger: Number of manager node exceeds the maxTokenServers setting.
Symptom: Abend/Crash
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18171
Problem description: The mmdf command hangs with a long waiter waiting for free space recovery, then in turn blocking subsequent conflicted commands.
Work around: None
Problem trigger: Run mmdf command on FPO file system while there are I/O workloads in progress.
Symptom: mmdf hang.
Platforms affected: ALL Operating System environments except AIX and Windows
Functional Area affected: FPO
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18238
Problem description: tslspool34 is called by GPFSPool zimon sensor which may cause cluster wide token contention on root directory of a file system
Work around: None
Problem trigger: tslspool34 is called by GPFSPool zimon sensor which will run once every 5 minutes. Per 5.0.3, zimon sensor will run on all nodes which will cause much token contention Since 5.0.3.1, zimon sensor runs only on restricted nodes.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: perfmon (Zimon)
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18138
Problem description: FSSTRUCT error could be issued during file creation if file system runs out of disk space for metadata.
Work around: None
Problem trigger: Running out of metadata disk space
Symptom: Error output/message
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18316
Problem description: When mmfsck is run on a file system having corruption in the inode allocation map the file system manager node of the file system being scanned can assert with - logAssertFailed: !"SeverityNone for FSTATUS_UNFIXED"
Work around: Disable the fsck patch queue feature on all nodes using this command, mmdsh -N all mmfsadm test fsck usePatchQueue 0, then rerun the mmfsck command.
Problem trigger: This issue will affect customers running mmfsck on IBM Spectrum Scale V4.2.3 or higher where they have additional reserved inode marked free corruption in the inode allocation map.
Symptom: Abend/Crash
Platforms affected: ALL Operating System environments
Functional Area affected: FSCK
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18174
Problem description: Long waiters which wait for "RDMA read/write completion fast" because in some cases RDMA requests pending in GPFS internal list may not be processed
Work around: None
Problem trigger: On a high load nsd server or GSS/ESS server which has verbsRdma enabled, RDMA requests may be queued in list if current in flight RDMA request count of a connection exceeds verbsRdmasPerConnection. In mutex conflict condition, they may not be processed when RDMA connection is closed or reconnected, and resulting in long waiters.
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: ALL Linux OS environments
Functional Area affected: RDMA
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18175
Problem description: When UID remapping is enabled IO performance is reduced due to the incorrect caching of the supplementary gids.
Work around: None
Problem trigger: Remote cluster mount with UID remapping enabled
Symptom: Performance Impact/Degradation
Platforms affected: ALL Operating System environments
Functional Area affected: Remote cluster mount/UID remapping
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18139
Problem description: Currently AFM has a default 8 seconds wait time before it can timeout a mount request for the remote export that is attempting to be mounted. In some cases there might be a real network delay that customer might want to have this 8 seconds value increased. Hence the need for a separate configurable.
Work around: Currently there is a daemon level tunable - "mmfsadm afm mountWaitTimeout " - which needs to be enabled on the affected AFM gateway node. But there's no way to know its configured value or the nodes that have the parameter defined. So we're making a global tunable which makes it easier to control the parameter and on which nodes the value needs to be tuned
Problem trigger: Perform the first IO to AFM fileset with large delay between the cache/primary gateway node to the home/secondary NFS serving node.
Symptom: Network Performance.
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM caching and AFM DR
Customer Impact: Suggested.
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18177
Problem description: File was mapped with shmat() and attempted to read past end of file (EOF). This causes a pagefault. The GPFS pagefault handler generated a kernel panic when it found the kernel buffer it is trying to transfer to a user buffer is not valid. The kernel crashes with the following error: kernel panic, assert !lcl._wasMapped)
Work around: None
Problem trigger: Reading last block of a file mapped with shmat()
Symptom: Kernel panic
Platforms affected: AIX
Functional Area affected: All Scale Users
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18178
Problem description: Hard coded path to tool not matching this Linux distrobution.
Work around: Create sym-link from /sbin/ibportstate to /usr/sbin/ibportstate
Problem trigger: SuSE or Debian Linux and IB networking
Symptom: Unexpected Results/Behavior
Platforms affected: SuSE Linux
Functional Area affected: System Health
Customer Impact: Suggested. has little or no impact on customer operation
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18180
Problem description: Kernel crashed due to checking for the wrong lock status of a mapped file.
Work around: None
Problem trigger: Reading mapped file.
Symptom: Abend/Crash
Platforms affected: ALL
Functional Area affected: All Scale Users
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18193
Problem description: When client nodes are at a level containing the optimization for file creation in an empty or small directory while the token managers are not, there is a chance that certain token revokes will enter into infinite retries
Work around: In a mixed cluster or multicluster environment, only let nodes with the optimization code play the manager role.
Problem trigger: The client nodes have the optimization for file creation in an empty or small directory while the token managers do not, and nodes from multiple client clusters try to access the same empty or small directory in which files are being created.
Symptom: Abend/Crash
Platforms affected: ALL
Functional Area affected: All Scale Users
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18321
Problem description: When offline mmfsck in read-only mode is run on a file system with a down NSD it will output an incorrect error message suggesting user to again restart file system check in read only mode
Work around: Ignore the message and bring the down disks back online or else if the disks cannot be brought back online then they will have to be deleted in order to run file system check.
Problem trigger: This issue will affect customers running mmfsck on IBM Spectrum Scale V5.0.2 or higher when running mmfsck on a file system having a down NSD.
Symptom: Incorrect error message
Platforms affected: ALL
Functional Area affected: FSCK
Customer Impact: Suggested.
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18182
Problem description: mmdeldisk or mmdf commands hang and wait on free space recovery. This happens when a node doesn't relinquish all the block allocation regions it owned during the process of unmounting a file system.
Work around: Restart Spectrum Scale service on file system manager node.
Problem trigger: Unmount the file system from a node.
Symptom: File system or file operations hang.
Platforms affected: ALL
Functional Area affected: All Scale Users
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18140
Problem description: Deadlock while changing the gateway node attribute using the mmchnode --gateway/--nogateway command.
Work around: None
Problem trigger: Gateway node change using the mmchnode command.
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: All Linux OS environments
Functional Area affected: AFM and AFM DR
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ17825
Problem description: Because of a race between the handling of memory mapping and normal reading of the same file a read from the last block of that mapped file returned wrong data.
Work around: None
Problem trigger: Multiple processes reading last block of a file and memory mapping of that same file.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL
Functional Area affected: All Scale Users
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18185
Problem description: GPFS asserts during mmcheckquota command when it encounters invalid fileset ids in the quota file.
Work around: None
Problem trigger: Invalid fileset ids, likely originating from deleted files, were erroneously inserted into the quota file causing an assertion in the mmcheckquota command.
Symptom: GPFS terminates during mmcheckquota command due to assertion.
Platforms affected: ALL
Functional Area affected: Quota
Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18142
Problem description: A kernel bugcheck with code PAGE_FAULT_IN_NONPAGED_AREA (50) can occur during mmmount on Windows 10 or Windows Server 2016.
Work around: None
Problem trigger: Running mmmount on Windows 10 or Windows Server 2016.
Symptom: Abend/Crash
Platforms affected: Windows/x86_64 only, specifically Windows 10 and Windows Server 2016.
Functional Area affected: Windows mount.
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18147
Problem description: There is a minor performance degradation in queueing of AFM IO request from the application node to the gateway node due to an inefficient algorithm for identifying the correct AFM gateway node.
Work around: There is no serious impact without the fix, only slower AFM IO performance.
Problem trigger: Performing IO to a AFM fileset from an application.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux and AIX OS environments (AIX as application nodes only)
Functional Area affected: AFM caching and AFM DR
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18216
Problem description: In a large cluster, file creation times can take as much as 15 seconds to complete on some nodes. This is because of the high default value of maxActiveIallocSegs which causes some nodes to use more inode allocation segments leading to starvation in other nodes.
Work around: Reduce maxActiveIallocSegs config parameter value.
Problem trigger: nNodes(local + remote)*maxActiveIallocSegs > available inode allocation records in a fileset (ie. nRegions)
Symptom: Performance Impact/Degradation
Platforms affected: ALL
Functional Area affected: All
Customer Impact: High Importance
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18155
Problem description: If multiple file systems are granted access to a remote cluster, the mmauth show -Y output for that cluster appears in multiple line like the regular output.
Work around: None
Problem trigger: A remote cluster with access to multiple file systems. The mmauth show -Y output is not in standard format.
Symptom: Output format
Platforms affected: ALL
Functional Area affected: GUI/System Health
Customer Impact: Suggested: has little or no impact on customer operation
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18191
Problem description: When debugging is enabled for mmbackup, tsqosnice is called to query QOS and then tsqosnice may terminate with a stack smashing error.
Work around: Do not use mmbackup debugging or remove the call to tsqosnice from the mmbackup script.
Problem trigger: See problem description.
Symptom: Stack smashing error message
Platforms affected: Linux
Functional Area affected: QOS, mmbackup
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18333
Problem description: logAssertFailed: ofP->inodeLk.get_lock_state() != 0 || ofP->mnodeLk.get_lock_state() != 0 || ofP->metadata.mnodeFast.fastpathIsEnabled(0x04000000) && ofP->metadata.mnodeFast.fastpathGetCount() > 0
Work around: Disable this logAssert by 'mmchconfig disableAssert' on releases which have 'disableAssert' configuration
Problem trigger: Mnode token revoke while gpfs in fast path of file read/write
Symptom: Abend/Crash
Platforms affected: ALL
Functional Area affected: All
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18217
Problem description: A program running against a gpfs filesytem with TCT installed can fail with ENOENT error.
Work around: None
Problem trigger: On a TCT enabled file system, if a file is unlinked and later an attempt is made to access the file, it can results in an ENOENT error.
Symptom: ENOENT error
Platforms affected: Linux Only
Functional Area affected: TCT
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18192
Problem description: GPFS has a specified but unenforced limit of 256 CPUs. It has an internal table limit at 1024 CPUs. If a large system presents to LINUX a possible configuration of more than 1024 CPUs then GPFS will generate a LOGASSERT for the unexpected configuration.
Work around: The work arounds are to revert the GPFS version or to configure the system firmware to present to LINUX fewer CPUs. Note that 'lscpu' and related commands won't show the possible limit that LINUX and GPFS must be ready for. The easiest way to view and vette what the system is presenting to LINUX and GPFS is to examine: ' cat /sys/devices/system/cpu/possible' and verify the value is < 1024 without this fix and < 1536 with this fix. It is very system specific how to partition and or re-configure a system to present a lower limit. Consult your system documentation.
Problem trigger: This triggers if 'cat /sys/devices/system/cpu/possible' > 1024 without this fix or > 1536 with this fix. The code was introduced in 5.0 . A system that violated this limit before release 5.0 would not have encountered this limit.
Symptom: The system will LOGASSERT with the following message: "GPFS daemon crash logAssertFailed: ucNumCpus <= 1024"
Platforms affected: Platforms affected are any large system, such as E980, that can be provisioned for more than 1024 CPUS
Functional Area affected: Per-cpu I/O counters; All users with this feature / functional area.
Customer Impact: Low - infrequent encounter of triggering system
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18220
Problem description: AFM gateway node gets remote assert while reading the data from the AFM home cluster as it does not block the filesystem quiesce
Work around: None
Problem trigger: Read operation of the uncached files on the AFM caching filesets.
Symptom: Abend/Crash
Platforms affected: All Linux OS environments
Functional Area affected: AFM
Customer Impact: Critical
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18218
Problem description: The ganesha NFS server can fail a NFS operation fails with EPERM and TCT is installed on the file system.
Work around: None
Problem trigger: During the NFS operation on a fd, if the connected dentry to a file can't be obtained, the operation is failed.
Symptom: EPERM error
Platforms affected: Linux Only
Functional Area affected: NFS
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18219
Problem description: There were two CES nodes, each one member of two CES groups. There were also four CES IPs, with two of them for each of the two CES groups assigned. The expectation was that each node gets one IP for each group (two IPs per node, one per group). That was not always the case. Sometimes one node got one IP, and the other one three.
Work around: IPs can be moved manually at any time to a node using "mmces address move --ces-ip xxx --ces-node yyy"
Problem trigger: The even-coverage IP balancing is done on a "per-group" basis among all nodes assigned to the same group. Additionally, the amount of already hosted IPs (member of the current or other groups) is considered in order to assign new IPs to nodes with the least assigned IPs That kind of logic did not always work as intended because during the startup phase the nodes may become healthy at different points in times. This impacts the IP movements because that is started as soon as the first node becomes healthy. If some nodes are healthy at a later time, then IP rebalancing is done to give them also IP addresses. That kind of logic did not work under all circumstances, so that sometimes a misbalanced but stable state remains.
Symptom: Unexpected Results/Behavior
Platforms affected: All Linux OS environments
Functional Area affected: CES
Customer Impact: little impact on customer operation
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18221
Problem description: On a file system without replication, it is possible for file system to panic with error 218 without additional information to help identify the disk that caused the error.
Work around: None
Problem trigger: Disk IO error
Symptom: Cluster/File System Outage
Platforms affected: All
Functional Area affected: All
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18075
Problem description: When the DMAPI function dm_getall_disp dampi() is called with bufLen = INT_MAX GPFS the function returns a buffer of smaller size due to integer overflow. This may cause memory corruption when moving data from this small buffer to a user buffer which is larger than the buffer allocated.
Work around: None
Problem trigger: DMAPI api calls that provide a buffer length of INT_MAX
Symptom: Node hangs or crash
Platforms affected: ALL Operating System environments
Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18475
Problem description: Unexpected GPFS daemon assert during file, directory, or symlink create operation.
Work around: None
Problem trigger: File system configuration changes such as enable/disable encryption Sy
Symptom: Abend/Crash
Platforms affected: ALL Operating System environments
Functional Area affected: All
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18525
Problem description: The file system which is used for dumping data is monitored and if it fills up will cause the GPFS component in mmhealth to show Failed thus triggering CES failover
Work around: Have DataStructureDump point to a path / file system with enough free space
Problem trigger: DataStructureDump path points to a almost full filesystem
Symptom: Component Level Outage
Platforms affected: ALL Linux OS environments
Functional Area affected: System Health CES
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18535
Problem description: The mmdiag --afm command was executed to fill the data for AFM fileset status and home exported path was NULL on deleted status of the fileset and this caused an assert.
Work around: None
Problem trigger: Deleting the fileset.
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM and AFM DR
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18541
Problem description: If you were using Spectrum Scale 5.0.2.x or older and had temporary connectivity issues in the past, which happened during a call home upload, this could break all further uploads via the ECC Client, even if you upgrade to a newer Spectrum Scale version.
Work around: ssh rm -f /var/mmfs/callhome/log/ecc/rsENCallECCLock.dat
Problem trigger: You were using Spectrum Scale 5.0.2.x or older and had temporary connectivity issues in the past, which happened during a call home upload.
Symptom: Component Level Outage
Platforms affected: ALL Linux OS environments
Functional Area affected: callhome
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18566
Problem description: To calculate available space in a SMB share Samba queries the quotas for the uid of the user and the gid for the user's primary group. The assumption here is that new files will be created with the user as owning and the user's primary group as the owning group. This assumption is not correct for directories with the set-group-ID bit set. In that case, the owning group from the directory will be applied to newly created files.
Work around: None
Problem trigger: Directories with the set-group-ID bit set will trigger this issue.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Operating System environments supporting CES.
Functional Area affected: SMB
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18476
Problem description: The tsspectrumdiscover application does not send events to external kafka sink due to unsuccessful processing of GPFS Kafka messages
Work around: None
Problem trigger: Running tsspectrumdiscover application post Spectrum Scale 5.0.3.0
Symptom: tsspectrumdiscover exit with error
Platforms affected: ALL Linux OS environments
Functional Area affected: Watch Folder
Customer Impact: Low
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18477
Problem description: Security hardening for the 'ts' commands in /usr/lpp/mmfs/bin/ .
Work around: Remove the setuid from the files in the /usr/lpp/mmfs/bin directory.
Problem trigger: Executing commands with certain undocumented input.
Symptom: Unexpected Results/Behavior
Platforms affected: All
Functional Area affected: admin commands
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18502
Problem description: In rare case, unmounting a GPFS file system may cause "kernel BUG at dcache.c:966 - dentry still in use (-128)" on linux-3.12, The race happens between shrink_dcache_for_umount() and token revoke (or gpfsSwapd)
Work around: None
Problem trigger: Unmount GPFS file system, the kernel panic is a very rare case.
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments linux-3.12
Functional Area affected: All
Customer Impact: Suggested
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18548
Problem description: The existing auto-recovery code does not handle descriptor only disks correctly and treats them as disks saving user data or metadata.
Work around: Disable auto-recovery.
Problem trigger: If auto-recovery is enabled and descOnly disks are configured in the cluster.
Symptom: If auto-recovery is enabled and descOnly disks are configured in the cluster, when a node fails, auto-recovery will treat descOnly disks as data disks and might cause data replication downgrade.
Platforms affected: ALL Linux OS environments
Functional Area affected: FPO
Customer Impact: Critical
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18549
Problem description: Restriping compressed files and then hit an assert on "wa" lock mode. This problem could only happen during restriping time on compressed files while these files are being truncated.
Work around: Rerun the restripe command
Problem trigger: Truncating compressed files while restripe is in progress.
Symptom: Abend on restripe process
Platforms affected: All
Functional Area affected: File compression and restripe
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18677
Problem description: Get the following error message in mmfs.log: Unexpected data in message. Header dump: XXXXXXXX XXXX, and daemon may crash because LOGSHUTDOWN is called
Work around: None
Problem trigger: Bad network and reconnect is attempted
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18680
Problem description: Hit the following assert after changing cipherList to AUTHONLY without restarting daemon: logAssertFailed: secSendCoalBuf != __null && secSendCoalBufLen > 0
Work around: None
Problem trigger: cipherList is changed from a supported algorithm to AUTHONLY without restarting daemon and reconnect is attempted
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18590
Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when a blocked lock waiter is reset by GPFS during daemon cleanup process, the inode object is not freed and left in the slab cache. Later, GPFS may access the legacy inode structure data, which causes kernel crash.
Work around: None
Problem trigger: A large fcntl locking workload and daemon cleanup process.
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18552
Problem description: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats" hits the reported segmentation fault due to NULL pointer dereference.
Work around: None
Problem trigger: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats"
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18591
Problem description: The NFS monitor checks the health state of a running NFS instance periodically. Sometimes the NFS service does not react on some "alive" check commands, and that is interpreted as a potential "hung" state. Based on the configuration in the mmsysmonitor.conf file either a failover or just a warning is triggered then.
Work around: The behavior or a detected potential "hung" state can be customized with the flag 'failoverunresponsivenfs' in the mmsysmonitor.conf file, section [nfs]. The meaning of the flag value is: "true" = set an ERROR event (nfs_not_active) if NFS does not respond to NULL requests and has no measurable NFS operation activity "false" = set an DEGRADED event (nfs_unresponsive) if NFS does not respond to NULL requests and has no measurable NFS operation activity
Problem trigger: In some cases high I/O load lead to the situation that NFS v3 and/or v4 NULL requests failed, and that a following internal statistics check reported no activity in respect to the number of internal NFS operations. These checks are done within a time span of several seconds to a minute. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily so that a failover would not be advised in this case. The monitor interprets the "unresponsiveness" as a potential "hung" state, and triggers either a failover or a warning, dependent on the configuration settings.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments (CES nodes)
Functional Area affected: Systemhealth
Customer Impact: High
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Item: IJ18678
Problem description: Accessing the .snapshots snaplink directory generates an I/O error, while creating or deleting snapshots for the same file system or fileset.
Work around: Stop the process accessing the .snapshots directory after getting I/O error, then retry the access to it again.
Problem trigger: This problem could be triggered by snapshot create and deletion operations.
Symptom: I/O error
Platforms affected: All Linux OS environments with kernel versions between 3.10.0-957.21.2 and 4.x.
Functional Area affected: Snapshots
Customer Impact: Critical
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
This update addresses the following APARs: IJ17825 IJ18075 IJ18076 IJ18078 IJ18100 IJ18102 IJ18105 IJ18109 IJ18110 IJ18136 IJ18137 IJ18138 IJ18139 IJ18140 IJ18142 IJ18144 IJ18147 IJ18155 IJ18165 IJ18171 IJ18174 IJ18175 IJ18177 IJ18178 IJ18180 IJ18182 IJ18185 IJ18191 IJ18192 IJ18193 IJ18216 IJ18217 IJ18218 IJ18219 IJ18220 IJ18221 IJ18238 IJ18309 IJ18315 IJ18316 IJ18321 IJ18333 IJ18475 IJ18476 IJ18477 IJ18502 IJ18525 IJ18535 IJ18541 IJ18548 IJ18549 IJ18552 IJ18566 IJ18590 IJ18591 IJ 18605 IJ18677 IJ18678 IJ18680.

Problems fixed in IBM Spectrum Scale 5.0.3.2 [July 18, 2019]

Problem description: There will be a long waiter like below: Waiting 8349.1305 sec since 00:03:05, monitored, thread 133060 AcquireBRTHandlerThread: on ThCond 0x3FFE74012E78 (MsgRecordCondvar), reason 'RPC wait' for tmMsgBRRevoke on node 192.168.117.82
Work around: No
Problem trigger: race condition between handling an inbound connection and node joining
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: All
Functional Area affected: All
Customer Impact: High IJ17133
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The mmlslicense command with the -Y option is displaying product edition information for all nodes in the list from the local node information. This is incorrect. It should only display the edition for the local node only and "-" for all other nodes. All of the other options on this command only display the local edition information as well.
Work around: Ignore edition information any node that is not the local node.
Problem trigger: Just running the command with a 2 node or more cluster
Symptom: Error output/message
Platforms affected: All
Functional Area affected: Admin Commands
Customer Impact: Suggested IJ17136
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The mmlsfileset command with "-i -d" options could run into an infinite loop when there is no enough free memory and indirect block descriptors in system. In addition, the similar loop issue could happen during mmrestripefs, snapshot deletion and ACL garbage collection processes.
Work around: Increase the maxFilesToCache to allow more indirect block descriptors in cache. Also make sure there's enough free physical memory in system.
Problem trigger: Run mmlsfileset -i -d, snapshot delete and mmrestripefs commands, or enable ACL, when no enough free physical memory in system with default or low configuration for maxFilesToCache parameter.
Symptom: The mmlsfileset, snapshot delete and mmrestripefs commands hang there and other mm* commands cannot proceed as well. The background ACL garbage collection thread is running in a loop if ACL is enabled.
Platforms affected: All
Functional Area affected: mmlsfileset, mmrestripefs, snapshot delete commands and ACL garbage collection process.
Customer Impact: Critical IJ16674
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The mmfs.log file may contain an entry like this: "[E] sdrServ: Communication error on socket /var/mmfs/mmsysmon/mmsysmonitor.socket, [err 79] Can not access a needed shared library"
Work around: N/A. The reported error code "79" is internally used, and means "connection refused".
Problem trigger: No recreate procedure available for the reported issue. The underlying issue was, that GPFS internal codes were not mapped to Linux system codes. That gave the wrong message text when printing the corresponding system message text for such a code.
Symptom: Error output/message
Platforms affected: ALL Linux OS environments
Functional Area affected: System Health
Customer Impact: has little or no impact on customer operation IJ16707
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: File system unmounted when application overwrite data blocks
Work around: None
Problem trigger: Overwriting data block followed by disk down in the file system.
Symptom: unmounted
Platforms affected: All
Functional Area affected: gpfs core
Customer Impact: High Importance IJ16712
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: On RHEL7.6 node, with supported GPFS versions 4.2.3.13 or higher and 5.0.2.2 or higher, when kernel upgrade to version 3.10.0-957.19.1 or 3.10.0-957.21.2 (after apply RHBA-2019:1337) or higher, the node may encounter a kernel crash while running an IO operations.
Work around: disable selinux
Problem trigger: An inconsistency between the GPFS kernel portability layer and the kernel level
Symptom: Abend/Crash
Platforms affected: RHEL7.6 with kernel 3.10.0-957.19.1 or higher
Functional Area affected: All
Customer Impact: High Importance IJ16783
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A user may create a file system with an unhealthy number of allocated inodes in the root fileset. This can cause the inode allocation map to become sub optimal when creating further independent filesets that don't have as many allocated inodes. The only way to reformat the inode allocation map is to recreate the file system.
Work around: Recreate file system with favorable inode allocation map parameters.
Problem trigger: Create file system with very large NumInodesToPreallocate.
Symptom: Performance Impact/Degradation
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ16716
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Raising the fsstruct_fixed event as stated in the documentation will not work and returns an error in version 5.0.2-x instead.
Work around: Include the file system name two times as arguments of mmsysmonc to raise fsstruct_fixed
Problem trigger: Spectrum Scale Version 5.0.2-x is installed
Symptom: Unexpected Results/Behavior
Platforms affected: All
Functional Area affected: System Health
Customer Impact: Suggested: has little or no impact on customer operation IJ16782
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: mmlslicense --capacity fails to report the correct disk size
Work around: Manually getting the disk size from blockdev command.
Problem trigger: Underlying device names are not found on all NSD servers
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments
Functional Area affected: Admin Commands
Customer Impact: Suggested: has little or no impact on customer operation IJ16678
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: There are 3 problems: 1. if an upload of a file >2GB crashes, this blocks all further not service ticket-related uploads of call home forever 2. the call home feature of resending of failed scheduled uploads does not work 3. If any of call home group members crashed during the data collection, mmsysmonitor.log on the group master will have a persistent repeating error entry in its log
Work around: For 3 aforementioned issues: 1. in LOCKINFO (/var/mmfs/callhome/log/ecc/rsENCallECCLock.dat) change FILE_SIZE to a value, which is less than 2G 2. none 3. delete on the call home master node the contents of /callhome/incomingFTDC2CallHome
Problem trigger: 1. upload of files >2GB, which are not service ticket-related 2. instable connection to ECuRep 3. call home group members crashing during the call home scheduled data collection
Symptom: Component Level Outage
Platforms affected: ALL Linux OS environments
Functional Area affected: Callhome
Customer Impact: Suggested IJ17147
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A node in the home cluster hit the following assertion when a remote node joins the cluster: 2019-04-16_14:55:37.346+0200: [X] logAssertFailed: (nodesPP[nidx] == NULL || nodesPP[nidx] == niP)
Work around: No
Problem trigger: remote node joins and leaves the cluster
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ16676
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Trying to clear the READONLY attribute of an immutable file through SMB succeeded within the retention period.
Work around: No
Problem trigger: A Windows SMB client is trying to clear the READONLY attribute on an immutable file that has not expired.
Symptom: Error output/message
Platforms affected: Windows Only
Functional Area affected: SMB/Immutability
Customer Impact: High Importance IJ17524
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: When an encryption policy references a key identifier that is longer than 64 characters, policy application fails.
Work around: No
Problem trigger: Create an encryption policy that references a key identifier which is longer than 64 characters and attempt to apply the policy
Symptom: Policy application fails.
Platforms affected: All
Functional Area affected: encryption
Customer Impact: Low IJ17569
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Memory leak when the gateway node joins the cluster. Reply data is not freed after obtaining the lead gateway node. Lead gateway functionality is no longer used.
Work around: No
Problem trigger: Gateway node joining the cluster.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17534
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Memory leak when the gateway node is not yet ready to handle the requests when the node designation is changed
Work around: No
Problem trigger: Gateway node joining the cluster.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17537
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when inode object was freed, companioned with a blocked lock waiter is resumed by GPFS, GPFS will try to free the file_lock along with the GPFS structure, and access the obsolete inode structure data, which causes kernel crash.
Work around: No
Problem trigger: A large fcntl locking workload and lock contention.
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments
Functional Area affected: All
Customer Impact: High Importance IJ17471
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Today we're going through all the nodes in the cluster (including those remote cluster nodes that mount the local filesystem), to find the single Gateway node for the fileset to which we need to queue the application IO request that is generated. On clusters having huge number of remote cluster mounted nodes, this causes a considerable application performance degradation.
Work around: No
Problem trigger: Have a large number of remote cluster nodes mounting the filesystem from the owning cluster. (customer has about 9000 such nodes mounting the FS). Now every time an application node sends request to the gateway node - in order to find the gateway node it needs to go through the entire list of 9K nodes to find this single node. In similar fashion the gateway node also needs to confirm that it is indeed the serving gateway node for the request sent. Verifying from the 9K node list. This takes up considerable amount of time in the application IO path to queue the request from the app node to gateway and ack from gateway back to application node in order to complete the application IO request.
Symptom: Silent performance degradation for the applications performing IO to the AFM fileset.
Platforms affected: ALL Linux OS environments (AFM Gateway nodes). All Linux and AIX environments (Application nodes running IO to the AFM fileset).
Functional Area affected: AFM - NFS and GPFS backend filesets. with afmHashVersions 2 and 5. With afmFastHashVersion tunable turned on.
Customer Impact: High Importance IJ17170
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: getxattr for 'security' namespace is not well blocked during quiesce that may cause assert "SGNotQuiesced"
Work around: No
Problem trigger: When file system is quiesced (for example when run mmcrsnapshot/mmdelsnapshot), all vfs operations should be blocked. If there are applications which accessing file's 'security' namespace extended attributes (for example 'getcap' command), that getxattr vfs operation is not well blocked and may cause assert "SGNotQuiesced"
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: Snapshots
Customer Impact: High Importance IJ17112
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: When RDMA connection is in bad situation, the new NSD requests will go remaining RDMA connections. The in flight NSD requests will fail back to TCP socket for them even there are still other remaining RDMA connections.
Work around: No
Problem trigger: port or link error on node which has multi IB ports
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: RDMA
Customer Impact: Suggested IJ17172
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: AFM prefetch does not work if the files have 64 bit inode numbers assigned to them. When checking the file for the cached bit, 32 bit inode number is used and the integer overflow might cause file's cached state to be returned as true.
Work around: No
Problem trigger: AFM prefetch
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17557
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Primary fileset might run out of inode space if large number of files are created/deleted.
Work around: No
Problem trigger: Inode space might be exhausted.
Symptom: Abend/Crash
Platforms affected: Linux Only
Functional Area affected: AFM DR
Customer Impact: IJ17175
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: When the UID remapping is enabled, daemon asserts or the kernel crash occurs on the nodes in the client cluster. This happens when the remapping scripts does not remap any credentials or the enableStatUIDremap is not enabled.
Work around: 1. For the daemon assert, correct the remap scripts to remap the credentials 2. For the kernel crash, enable enableStatUIDremap config option
Problem trigger: UID remapping with incorrect mmname2uid script and file metadata modification when enableStatUIDremap is not enabled.
Symptom: Abend/crash
Platforms affected: All
Functional Area affected: Remote cluster mount/UID remapping
Customer Impact: High Importance IJ17114
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: AFM prefetch on the small sized files have performance issue as the file is flushed to the disk without closing the open instance. This causes file not to be shrunk to fit into the subblocks and the full block of data is transferred to the NSD server.
Work around: No
Problem trigger: AFM prefetch
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17576
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Data write operation is being performed if file was already synced but migrated at secondary in role reversal feature. If file(s) are migrated then write operation should be skipped in role reversal and set only attrs.
Work around: None
Problem trigger: Migrated files are there in role reversal
Symptom: Write operation is happened on migrated file.
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM and AFM DR
Customer Impact: Suggested IJ17570
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: RPC message was reported as lost, like below: Message ID 735239 was lost by node ip_address node_name wasLost 1
Work around: None
Problem trigger: Network is not good which leads to reconnect happening several times
Symptom: Node expel/Lost Membership
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ17538
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: FSSTRUCT error: FSErrValidate could be generated in system log after adding new disk to a file sytem.
Work around: None
Problem trigger: Add new disk to a file system while running GPFS 5.0.1.0 thru 5.0.3.1
Symptom: Error output/message
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ17554
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: After reboot of a node the systemhealth NFS monitoring was started, but not the SMB component and monitoring. AD authentication was configured for NFS, which depends on a running SMB component. This constellation yield to a "winbind-down" event, but gave no hint about the root cause
Work around: mmshutdown followed by mmstartup might help, since the entire stack (including SMB/NFS and their monitors) are restarted. The log level could be increased during the startup and check phase (mmces log level 3) to get more details in the mmfs.log file. For production, this log level should be lowered ( to 0 or 1).
Problem trigger: The circumstances which may lead to the detected mismatch were not repeatable. This seems to be a rare race situation, and was not reported before.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments (CES nodes)
Functional Area affected: CES
Customer Impact: High Importance IJ17559
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: mmexpelnode fails, when cluster manager and file system managers network cable pulled for CCR enabled clusters and tiebreaker disks configured. GPFS file systems got unmounted on other hosts.
Work around: None
Problem trigger: mmexpelnode executed in a CCR enabled cluster with tiebraker disks configured.
Symptom: Unexpected Results/Behavior
Platforms affected: All
Functional Area affected: Admin Commands (mmexpelnode)
Customer Impact: High Importance IJ17580
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: When accessing gpfs zlib compressed file by mmap (or execute a gpfs zlib compressed executable file), kernel may crash with oops message "unable to handle kernel paging request" at IoDone routine
Work around: None
Problem trigger: accessing gpfs zlib compressed file by mmap (or execute a zlib compressed executable file)
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments
Functional Area affected: GPFS Native Compression
Customer Impact: High Importance IJ17593IJ17593
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Deadlock when AFM filesets are accessed using the remote mounted file system due to the mismatch in the gateway node configuration between client (remote) and storage (home) clusters. It is unclear how the configuration mismatch happens.
Work around: None
Problem trigger:
Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
Platforms affected: All OS environments
Functional Area affected: AFM and AFM DR
Customer Impact: Critical IJ17581I
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A stripe group / file system manager panic occurs while another node (non-SGmgr) is accessing files in a snapshot. These accesses can be part of the snapshot deletion itself, or another maintenance command (such as mmdeldisk or mmrestripefs), or even ordinary user accesses from the kernel. The diagnostic error reported in the log on the stripe group (SG) manager node looks like this, though the line number may vary: 2019-05-06_23:23:22.122-0300: [X] File System fs1 unmounted by the system with return code 2 reason code 0, at line 4646 in /afs/apd.pok.ibm.com/u/gpfsbld/buildh/ttn423ptf13/src/avs/fs/mmfs/ts/fs/llio.C The "unmount in llio.C" message is usually followed by a message mentioning "Reason: SGPanic", but this does not always occur, and a SGPanic can be caused by other unrelated problems. The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be difficult to determine which access or which node caused the panic. Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
Work around: The work-around is to remove DeleteRequired snapshots with an mmdelsnapshot command with an explicit -N argument listing only the SG manager node.
Problem trigger: The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be be to an earlier snapshot (with smaller snapId); though it may be Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
Symptom: Cluster/File System Outage
Platforms affected: All OS environments
Functional Area affected: Snapshots
Customer Impact: Suggested: has little or no impact on customer operation IJ17595
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: For AFM migration, provide an option to revalidate with home once after the cut over to the new system for the performance improvement during the fileset access.
Work around: None
Problem trigger: AFM migration
Symptom: Performance Impact/Degradation
Platforms affected: All OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17582
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: gnrhealthcheck is not catching the case where an ESS system is setup without having verified that that both servers see the enclosures/drives.
Work around: None
Problem trigger: This problem is caused by an invalid ESS deployment.
Symptom: Error output/message
Platforms affected: ALL Linux OS environments
Functional Area affected: ESS/GNR
Customer Impact: Suggested: IJ17583
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: When running I/O with NFS unexpected failovers occurred without an obvious reason. NFS is reported as 'not active', even it is still working.
Work around: No workaround available. There is a manuel way to temporary modify the event declaration for the observed "nfs_not_active" event by modifying the event action in the event configuration file ( ask L2 for support).
Problem trigger: In the reported cases some high I/O load lead to the situation that NFS v3 and/or v4 (whatever is configured) NULL requests failed, and that a following internal statistics check reported no activity regarding the number of internal NFS operations. The monitor interpreted this as a "hung" state and triggered a failover. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily, so that a failover is not advised in this case. However, at the time of monitoring there was no further indication available.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments (CES nodes)
Functional Area affected: Systemhealth
Customer Impact: High Importance IJ17598
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: QOS may deadlock on the file system manager node, particularly if there are many (hundreds) of nodes mounting the file system and the manager node is is heavily CPU or network loaded.
Work around: 1) mmchqos FS stat-slot-time 15000 stat-poll-interval 60 or if that is not sufficient... 2) Disable QOS until fix is available.
Problem trigger: See problem description.
Symptom: Hang or Deadlock
Platforms affected: ALL
Functional Area affected: QOS
Customer Impact: High Importance, especially for customers using QOS with hundreds of nodes mounting the file system. IJ17584
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A filesystem containing a dot in the name was declared as to be ignored by declaring a file /var/mmfs/etc/ignoreAnyMount.. However, the systemhealth monitor treated it as a missing filesystem.
Work around: No work around available. Filesystems could be named with an underscore instead of a dot, if a separator is wanted
Problem trigger: A filename /var/mmfs/etc/ignoreAnyMount. is split internally by dots, so that it results in three items (which is not wanted): /var/mmfs/etc/ignoreAnyMount filesystemWith dot
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments (CES nodes)
Functional Area affected: Systemhealth
Customer Impact: little impact on customer operation IJ17600
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Customer cannot create a smb export under specific conditions.
Work around: Choose names of gpfs file systems while no file system is a substring of any other
Problem trigger:
Symptom: Customer is limited to special setup for his gpfs file systems
Platforms affected: ALL Linux OS environments
Functional Area affected: SMB
Customer Impact: Suggested IJ17585
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: If a relative pathname is provided in an export definition, the mmnfs command will allow it which will cause the Ganesha NFS server to fail.
Work around: None
Problem trigger: Relative pathname to --pseudo option of the mmnfs command.
Symptom: Unexpected results.
Platforms affected: Linux
Functional Area affected: Protocols
Customer Impact: Suggested IJ17607
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: AFM is unable to prefetch the data if the file metadata is changed. For example if the user changes the metadata (ex. chmod) on the uncached file, prefetch skips reading the file.
Work around: Read the file manually without the prefetch
Problem trigger: AFM prefetch
Symptom: Unexpected Results/Behavior
Platforms affected: All OS environments
Functional Area affected: AFM
Customer Impact: High Importance IJ17601
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: "mmhealth node show" might show degraded status for CLOUDGATEWAY even though "mmcloudgateway service status -N tctServers" shows all OK
Work around: None
Problem trigger: If Cloudgateway was in a degraded state and changed to "only_ensures_cloud_container_exists" status it did not trigger mmhealth to go to a "healthy" state.
Symptom: Unexpected Results/Behavior
Platforms affected: Linux
Functional Area affected: System Health TCT
Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ17665
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
This update addresses the following APARs: IJ16674 IJ16676 IJ16678 IJ16707 IJ16712 IJ16716 IJ16782 IJ16783 IJ17112 IJ17114 IJ17133 IJ17136 IJ17147 IJ17170 IJ17172 IJ17175 IJ17471 IJ17524 IJ17534 IJ17537 IJ17538 IJ17554 IJ17557 IJ17559 IJ17569 IJ17570 IJ17576 IJ17580 IJ17581 IJ17582 IJ17583 IJ17584 IJ17585 IJ17593 IJ17595 IJ17598 IJ17600 IJ17601 IJ17607 IJ17665.

Problems fixed in Spectrum Scale 5.0.3.1 [May 31, 2019]

Problem description: When creating DMAPI session there is a small window where memory is getting corrupted causing GPFS daemon crash with sig 11.
Work Around: None
Problem trigger: Creating lots of DMAPI sessions with heavy workload
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: DMAPI
Customer Impact: Suggested IJ15859
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: RPCs sending via RDMA are pending there forever and they are in 'sending' state. Long waiters with Verbs RDMA like: Waiting 2273.0813 sec since 11:05:04, monitored, thread 113229 BackgroundSyncThread: for RDMA send completion fast on node 192.168.1.1
Work Around: None
Problem trigger: Reply lost on RDMA network
Symptom: Hang
Platforms affected: ALL Linux OS environments
Functional Area affected: RDMA
Customer Impact: High IJ15892
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: If gpfs is shutdown on a node it it possible that ces ips are assigned to this nodes two minutes after shutdown. This ces ips are not usable for the customer.
Work Around: Suspend node before gpfs shutdown.
Problem trigger: Node has still a valid gpfs lease two minutes after shutdown.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL
Functional Area affected: CES
Customer Impact: High IJ15912
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A race condition may cause mmperfmon that update sensor fail with the following message: fput failed: Invalid version on put (err 807) Other commands fail with the above message as well.
Work Around: Rerun the failed command.
Problem trigger: Problem hit more often using spectrum command to install.
Symptom: Error output/message "fput failed: Invalid version on put (err 807)" Upgrade/Install failure
Platforms affected: ALL Operating System environments but more oftern on Linux nodes in CCR environment.
Functional Area affected: Admin Commands
Customer Impact: Suggested IJ16079
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: "mmuserauth service create" command failed due to TCP port 445 being blocked. However, error message indicated incorrect credentials which was not the correct reason for failure.
Work Around: None
Problem trigger: The issue is seen at the time of configuring Authentication, in those setups where TCP Port 445 is blocked. The command internally tries to connect to the DC specified via the Port. Due to blocked port, it fails to connect with a timeout. However, the error message shown currently indicates of incorrect credentials which is not the case.
Symptom: Error output/message
Platforms affected: ALL Linux OS environments
Functional Area affected: Authentication
Customer Impact: Suggested IJ16084IJ16084
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: FSErrInodeCorrupted FSSTRUCT error could be written to system log as result of stale buffer for directory block.
Work Around: None
Problem trigger: Change in token manager list as result of either node failure or change in number of manager nodes.
Symptom: Error output/message
Platforms affected: ALL Linux OS environments
Functional Area affected: All
Customer Impact: Suggested IJ16085
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The output of mmlscluster --ces show multiple entries for the same IP address. The cesiplist file (stored in ccr) did contain these multiple entries, so mmlscluster just displayed them. This was obviously a misconfiguration.
Work Around: A reassignment of IPs (moves, failover,suspend/resume) triggers some rewrite of the cesiplist file, which cleans up these inconsistencies. It is necessary that the affected node is involved in the IP movement.
Problem trigger: The circumstances which may lead to multiple IP entries of the same IP for a node is not known. This seems to happen occasionally, but very rarely.
Symptom: Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments (CES nodes)
Functional Area affected: CES
Customer Impact: has little or no impact on customer operation IJ16091
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Unexpected wndb down after smb startup without know reason at log level 0.
Work Around: Start wndb manually.
Problem trigger: Unknown
Symptom: Unexpected Results/Behavior
Platforms affected: All
Functional Area affected: CES
Customer Impact: Medium Importance IJ16093
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Trying to delete an immutable file through SMB fails after the retention period expires. The problem is that Samba as SMB server denies deletion when the READONLY flag is set.
Work Around: None
Problem trigger: A Windows SMB client is trying to delete an immutable file after the retention period expires.
Symptom: Error output/message
Platforms affected: Windows Only
Functional Area affected: SMB/Immutability
Customer Impact: High Importance IJ16094
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: If too many pdisks are unreadable (not missing) because of which we are not able to write to a vtrack, it is possible that we commit the stale strips information to metadata log. When scrubber tries to scrub the vtrack, it will examine this stale strip data and declare data loss.
Work Around: None
Problem trigger: Unavailability of pdisks to do a write vtrack.
Symptom: IO error.
Platforms affected: ALL Linux OS environments
Functional Area affected: ESS/GNR
Customer Impact: Critical IJ16095
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

Problem description: FSErrCheckHeaderFailed error could be correctly issued and logged in the system log.

Work Around: None

Problem trigger: User application move files out of directory before deleting the directory.

Symptom: Error output/message

Platforms affected: All

Functional Area affected: All

Customer Impact: Suggested IJ15910
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: GPFS daemon will sig11 or log assert with "offset < ddbP->mappedLen" when user application, log recovery, tsdbfs or mmfsck command access a corrupted directory (directory's file size is smaller than 32 Bytes - the size of directory block header structure).
Work Around: None
Problem trigger: This kind corrupted directory could be caused by previous code bug.
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ15909
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Few operations on the IW fileset will take longer than expected as there's a unintended dependency created on previous operations performed on the fileset which will be attempted to replicate to the remote/home side via the operation that is currently performed.
Work Around: None
Problem trigger: Users running 5.0.3 and running workload on AFM IW mode filesets should see a few elongated operations (performance impact) on the filesets owing to a few dependent operations performed on the same file/fileset earlier - which are waiting to be asynchronously pushed to the home/remote site.
Symptom: Few operations on the IW fileset might take longer than expected - since it is working other asynchronous operations as its dependents to the remote site. Few waiters might be seen to linger for a few extra seconds and once the dependencies are resolved the waiters should vanish.
Platforms affected: ALL Operating System environments (AFM application and Gateway nodes).
Functional Area affected: AFM - and Specifically users on AFM IW mode filesets only.
Customer Impact: High Importance. IJ16110
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Enable AFM prefetch for the single fileset to run from the multiple gateway nodes for improving the migration performance
Work Around: None
Problem trigger: AFM prefetch, slow performance
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: Suggested IJ16112
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: GPFS daemon crash when application writing data into file system
Work Around: None
Problem trigger: A memory failure of newBuffer in a busy system.
Symptom: Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ15993
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Enable AFM prefetch for the single fileset to run from the multiple gateway nodes for improving the migration performance. This enhancement also handles the scenario where same file is being read from the multiple gateway nodes.
Work Around: None
Problem trigger: AFM prefetch, slow performance
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM
Customer Impact: Suggested IJ16113
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: On a system without ifup/ifdown commands installed, nearly any call to a mm-command shows messages like which: no ifup in (/bin:/usr/bin:/sbin:/usr/sbin:/usr/lpp/mmfs/bin) which: no ifdown in (/bin:/usr/bin:/sbin:/usr/sbin:/usr/lpp/mmfs/bin) and terminate the called mm-program
Work Around: Not available. An install of ifup/ifdown would resolve the issue, but might yield to other issues
Problem trigger: Any mm-command may run into this issue if the ifup/ifdown commands are not installed on the system
Symptom: Error output/message
Platforms affected: ALL Linux OS environments
Functional Area affected: CES
Customer Impact: High Importance IJ16114
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The mmfsadm dump command could run into an infinite loop when dumping the token objects.
Work Around: avoid to run mmfsadm dump command.
Problem trigger: run mmfsadm dump command while workloads are running in the cluster.
Symptom: mmfsadm dump command hang.
Platforms affected: ALL Operating System environments except Windows
Functional Area affected: mmfsadm dump command
Customer Impact: Suggested IJ15996
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: If the file system was formatted with narrow disk address (2.2 version or older), and the gpfs version is 4.2.3 or 5.0.x version, GPFS daemon assert would happen randomly.
Work Around: None
Problem trigger: Application I/O into a narrow disk address file system by using 4.2.3 or 5.0.x GPFS versions.
Symptom: Crash, likes assert subblocksPerFileBlock==(1<<(tinodeP->getFblockSize()))
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ16116
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The mmrepquota -q and -t option command usage is ambiguous. Options -q and -t should not be used when combined with Device:Fileset because they are file system attributes.
Work Around: None
Problem trigger: The current mmrepquota command usage allows invoking -q option as follows: mmrepquota -q Device:fileset
Symptom: mmrepquota -q Device:fileset gives file system default quota information and not perfileset-quota.
Platforms affected: All
Functional Area affected: Quotas
Customer Impact: Suggested IJ15914
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: For file systems created with large NumNodes and large NumInodesToPreallocate arguments, the inode allocation map ends up with a large value for nRegions and nBitsPerSubsegment. For subsequent independent filesets created with orders of magnitude less NumInodesToPreallocate, this can leave most of the inode map segments as unusable/surplus. During inode lookup as part of inode allocation, these surplus segments may be read from disk many times causing performance degradation.
Work Around: Increase allocated inodes in the problem fileset.
Problem trigger: File systems created with large NumNodes and large NumInodesToPreallocate arguments. Then independent filesets are created with orders of magnitude less NumInodesToPreallocate.
Symptom: Performance Impact/Degradation
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ15991
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Fileset might get stuck and prevent filesystem quiesce when AFM DR filesets finds that inode did not have remote attributes and it tries to build the remote attributes using tsfindinode command after blocking the filesystem quiesce. Remote attributes are used to find the remote file using the file handle for the replication.
Work Around: None
Problem trigger: AFM DR with renames to the deleted directories
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments
Functional Area affected: AFM DR
Customer Impact: Critical IJ16024
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: FSErrInodeCorrupted FSSTRUCT error could be issued incorrectly during lookup when both directory and its parent directory are being deleted.
Work Around: None
Problem trigger: Perform lookup on '..' entry of a directory that is being deleted.
Symptom: Error output/message
Platforms affected: All
Functional Area affected: All
Customer Impact: Suggested IJ15916
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: During file manager take over, the new manager will broadcast to all mount nodes to invalidate their cached low level file metadata. If at the same time, a low level file is being opened on the mount node, they have chance to race and causes logAssertFailed "ibdP->llfileP == this" or logAssertFailed "inode.indirectionLevel >= 1
Work Around: One of our customers reports they hit this problem while they run mmdelsnapshot. For mmdelsnapshot scenario, deleting the oldest snapshot first will greatly reduce the risk.
Problem trigger: The race existing between file manager take over and low level file opening (the latter one can happen for many reasons - including but not limited to mmdelsnapshot)
Symptom: Abend/Crash
Platforms affected: All
Functional Area affected: All
Customer Impact: High Importance IJ15961
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: GPFS admin commands may cause high CPU usage. This is due to remote GPFS command calls find command to cleanup tmp files on system with large number of subdirs and files under /var/mmfs/tmp.
Work Around: Manually cleanup to reduce the number of subdirs and files under /var/mmfs/tmp. Kill running find processes that invoked from /usr/lpp/mmfs/mmremote processes.
Problem trigger: Nodes with large number of subdirs and files under /var/mmfs/tmp are mostly likely affected.
Symptom: Performance Impact/Degradation, hang
Platforms affected: All
Functional Area affected: Admin Commands
Customer Impact: High Importance an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ15858
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Not checking session info length when creating DMAPI session which is supposed to be less than or equal to 256 bytes. As per DMAPI standards it needs to return E2BIG errno. Instead GPFS is truncating the length to 256 bytes and proceeding with the session creation.
Work Around: None
Problem trigger: Creating DMAPI session with very long session info string
Symptom: None
Platforms affected: All
Functional Area affected: DMAPI
Customer Impact: Suggested IJ16117
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The arping command is used by the NFS failover mechanism, but was not found on the system. It was installed, but the log files show a No such file or directory message, which indicates that the arping command was not found in the expected path.
Work Around: Probably it would help to set a symbolic link from the arping command to "/usr/bin/arping", which is the default if the distro could not be properly detected. Basically using links is not advised, since they could be a security issue.
Problem trigger: The circumstances which lead to the issue is not fully understood. Most likely the OS detection using the /etc/redhat-release file detection did not work, so that the wrong distro was assumed, which lead to a wrong expected path name for the arping command location. So finally it was not found then. This older CentOS version does not yet have the /etc/os-release file provided by newer distros, which we use meanwhile, too.
Symptom: Error output/message
Platforms affected: All CentOS environments (CES nodes)
Functional Area affected: CES
Customer Impact: has little or no impact on customer operation IJ15998
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Deadlock during the AFM fileset recovery due to lock ordering issue when rename operations are being executed
Work Around: None
Problem trigger: AFM fileset recovery with renames to newly created directories.
Symptom: Long Waiters/Deadlock
Platforms affected: All Linux OS
Functional Area affected: AFM and AFM DR
Customer Impact: Critical IJ15963
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: gpfs systemd service (gpfs.service) may report fail after shutdown
Work Around: The fail systemd status is not an error condition of GPFS shutdown. The systemd fail status can be ignored.
Problem trigger: When shutting down GPFS, if the main systemd process (runmmfs) does not exit quickly, a kill signal is sent to the main process either by the shutdown subroutine or by systemd manager itself.
Symptom: Error output/message Unexpected Results/Behavior
Platforms affected: ALL Linux OS environments with systemd version >= 219
Functional Area affected: Admin Commands/systemd
Customer Impact: has little or no impact on customer operation IJ15962
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Using mmchcluster command to enable CCR may fail. While the mmchcluster command is working to enable ccr, any other mm cmd can remove authorized_ccr_keys file which is needed for in the final step of CCR enable. This problem occurs more often when the first quorum node in the list is on a GPFS supported systemd system. If the mmchcluster command is running on a quorum node, the mmchcluster command considers that node is the first quorum node in the list.
Work Around: Run mmchcluster on a quorum node that does not support GPFS systemd. Or temporarily disable system health chmod 000 /usr/lpp/mmfs/bin/mmsysmon*
Problem trigger: While the mmchcluster command is working to enable ccr, any other mm cmd can remove authorized_ccr_keys file which is needed for in the final step of CCR enable.
Symptom: Error output/message
Platforms affected: ALL Linux OS environments with systemd version >= 219
Functional Area affected: CCR Admin Commands
Customer Impact: High Importance to customers that want to enable CCR. IJ15915
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Some GPFS commands don't work correctly if the cluster name contains special characters.
Work Around: Change the name of the cluster so that it does not contain any special characters.
Problem trigger: Cluster name with special character like the ampersand "&" causes command like mmauth show . to fail
Symptom: GPFS admin commands error. Error output/message Unexpected Results/Behavior
Platforms affected: all
Functional Area affected: admin commands
Customer Impact: Low Importance IJ15908
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: AFM does not keep directory mtime in sync while reading the directory contents from the home. This may be a problem for some users during the migration
Work Around: None
Problem trigger: AFM migration/prefetch or cache readdir/lookup
Symptom: Unexpected results
Platforms affected: All Linux OS
Functional Area affected: AFM
Customer Impact: Critical IJ15990
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: The NFS/Ganesha service did not process I/O, and the systemhealth monitor showed that the NFS NULL checks for protocol versions 3 and 4 failed. The Ganesha process was shown in the process list, and also logging and replies to requests via Dbus worked. There was no failover.
Work Around: Manual restart of NFS/Ganesha mmces service stop nfs ( or kill the gpfs.ganesha process) then mmces service start nfs
Problem trigger: The reason why the NFS/Ganesha hung was not evaluated. The main issue was that the Ganesha process was not entirely "dead" since the process was running, and it replied to remote requests via Dbus and also wrote log entries. It was "dead" regarding I/O handling, but the systemhealth monitor did not notice this properly.
Symptom: Performance Impact/Degradation
Platforms affected: ALL Linux OS environments (CES Nodes running NFS)
Functional Area affected: CES
Customer Impact: High Importance IJ16036
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Assert exp(totalLen <= extensionLen) in line 16424 of file /project/sprelttn423/build/rttn423s008a/src/avs/fs/mmfs/ts/nsd/nsdServer.C
Work Around: None
Problem trigger: This issue affects customers running IBM Spectrum Scale 4.2.3 and later if the following conditions are true 1) mixed-endianess cluster, or mixed-endianess remote clusters. 2) RDMA enabled (and NSD client may send NSD requests to a NSD server which has a different endianess) 3) NSD client or NSD Server is IBM Spectrum Scale 4.2.3 It's a rare case assert which may happen when the client sends the first NSD request to a NSD server which has different endianess.
Symptom: Abend/Crash
Platforms affected: ALL Linux OS environments
Functional Area affected: RDMA
Customer Impact: Suggested IJ16020
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: Not failing with err 22 when dm_getall_disp dmapi is called with bad sessionId
Work Around: None
Problem trigger: When dm_getall_disp is called bad sessionId
Symptom: Error output/message
Platforms affected: ALL
Functional Area affected: DMAPI
Customer Impact: Suggested IJ16064
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: mmfsck man page provides instruction to clear the fsstruct error from the mmhealth command. "mmsysmonc event filesystem fsstruct_fixed " But this is not correct. As a result, documented command will fail with syntax error.
Work Around: None
Problem trigger: Executing command as instructed in man page
Symptom: Error output/message due to documentation problem
Platforms affected: ALL Operating System environments
Functional Area affected: System Health
Customer Impact: High Importance IJ16329
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
Problem description: A recent performance change in GPFS 5.0.3 that GPFS commands more sensitive to network congestion. This causes command like mmgetstate -a to report unknown status or other GPFS commands to report nodes unreached.
Work Around: Command like mmgetstate -a can be issued again to get the status.
Problem trigger: This affects only on node running GPFS 5.0.3. It affects all GPFS admin commands that need to execute command remotely.
Symptom: Error message like the below: "The following nodes could not be reached:" mmgetstate -N or -a reports "unknown" state.
Platforms affected: All
Functional Area affected: Admin Commands
Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ16395
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
This update addresses the following APARs: IJ15858 IJ15859 IJ15892 IJ15908 IJ15909 IJ15910 IJ15912 IJ15914 IJ15915 IJ15916 IJ15961 IJ15962 IJ15963 IJ15990 IJ15991 IJ15993 IJ15996 IJ15998 IJ16020 IJ16024 IJ16036 IJ16064 IJ16079 IJ16084 IJ16085 IJ16091 IJ16093 IJ16094 IJ16095 IJ16110 IJ16112 IJ16113 IJ16114 IJ16116 IJ16117 IJ16329 IJ16395.

Problems fixed in Spectrum Scale 5.0.3.3 for Protocols include the following:

smb: Add missing newline in ctdb_states_proxy error message

smb: Add additional owner entry when mapping to NFS4 ACL with IDMAP_TYPE_BOTH and implement special case for denying owner access to ACL

smb: Add gpfs.smb 4.9.8_gpfs_21-4

Problems fixed in Spectrum Scale 5.0.3.2 for Protocols include the following:

smb: Return share name in correct case from net rpc conf showshare
smb: Add gpfs.smb 4.9.8_gpfs_21-1

Problems fixed in Spectrum Scale 5.0.3.1 for Protocols include the following:

gui: AD names should allow dots

gui: Better handling on warning message for remote mounted file systems

gui: Filesets - The "Type" and "AFM Role" displayed in the export correction

gui: Updates required to accurately show GNR User Condition definitions

gui: The CAPACITY_LICENSE task fails when there are no NSDs

gui: Edit quota dialog not displayed for user,group,fileset quotas

gui: Do no longer display a warning or error icon on SSD endurance percentage

gui: Hourly call to mmaudit list should not occur

toolkit: Fixed WCE parsing for some SAS cards

smb: Version 4.9.7_gpfs_20-1

smb: Change the memory check to cover the total of main memory and swap space

smb: Stabilize gencache after gencache flush

smb: Fill gencache with domain info returned from domain controller

smb: Enable logging for early startup failures

smb: Properly track the size of talloc objects

smb: Remove implementations of SaveKey/RestoreKey

smb: Pass back what we have in _wbint_Sids2UnixIDs().

callhome: Updated to 5.0.3-1 nomenclature

kafka: Updated to 5.0.3-1 nomenclature

Problems fixed in Spectrum Scale Protocols Packages 5.0.3-0 [Apr 19, 2019]

Please see the "What's New" page in the IBM Knowledge Center

Copyright and trademark information

http://www.ibm.com/legal/copytrade.shtml

Notices

INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

Microsoft, Windows, and Windows Server are trademarks of Microsoft Corporation in the United States, other countries, or both.

Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.

THIRD-PARTY LICENSE TERMS AND CONDITIONS, NOTICES AND INFORMATION

The license agreement for this product refers you to this file for details concerning terms and conditions applicable to third party software code included in this product, and for certain notices and other information IBM must provide to you under its license to certain software code. The relevant terms and conditions, notices and other information are provided or referenced below. Please note that any non-English version of the licenses below is unofficial and is provided to you for your convenience only. The English version of the licenses below, provided as part of the English version of this file, is the official version.

Notwithstanding the terms and conditions of any other agreement you may have with IBM or any of its related or affiliated entities (collectively "IBM"), the third party software code identified below are "Excluded Components" and are subject to the following terms and conditions:

the Excluded Components are provided on an "AS IS" basis
IBM DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED WARRANTIES AND CONDITIONS WITH RESPECT TO THE EXCLUDED COMPONENTS, INCLUDING, BUT NOT LIMITED TO, THE WARRANTY OF NON-INFRINGEMENT OR INTERFERENCE AND THE IMPLIED WARRANTIES AND CONDITIONS OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
IBM will not be liable to you or indemnify you for any claims related to the Excluded Components
IBM will not be liable for any direct, indirect, incidental, special, exemplary, punitive or consequential damages with respect to the Excluded Components.

Product/Component Name:	Platform:	Fix:
IBM Spectrum Scale	Linux 64-bit,x86_64 RHEL Linux 64-bit,x86_64 SLES Linux 64-bit,x86_64 Ubuntu	Spectrum_Scale_Erasure_Code-5.0.3.3-x86_64-Linux

Readme and Release notes for release 5.0.3.3 IBM Spectrum Scale 5.0.3.3 Spectrum_Scale_Erasure_Code-5.0.3.3-x86_64-Linux Readme