Unless specifically noted otherwise, this history of problems fixed for IBM Spectrum Scale 5.0.x applies for all supported platforms.
Problems fixed in IBM Spectrum Scale 5.0.3.3 [September 12, 2019]
- Item: IJ18076
- Problem description: A race between the thread handling dm_create_session and an mmchmgr command caused a new DMAPI session id not to be sent to the new GPFS configuration manager node. When a dm_destroy_session is called to destroy that session id it failed with the error EINVAL since the new configuration manager node doesn't know about that session id.
- Work around: None
- Problem trigger: Running mmchmgr when dm_create_session is in progress.
- Symptom: dm_destroy_session fails with EINVAL error.
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS file systems
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18605
- Problem description: On a RHEL 7.6 node, with GPFS versions 4.2.3.13 or higher or 5.0.2.2 or higher, when the kernel is upgraded to version 3.10.0-957.14.1 or higher, the node may encounter an I/O error when accessing a renamed directory. For example, on the RHEL 7.6 node: cd dir1; on the other cluster node, rename the directory to the new name: mv dir1 dir2. Then, dir2 cannot be accessed on the RHEL 7.6 node.
- Work around: On the node, exit from the directory with old name (say, with "cd .."), and access it again by ls -ld; then the new name directory can be accessed
- Problem trigger: This issue affects customers running IBM Spectrum Scale V4.2.3.13 or higher and 5.0.2.2 or higher under following scenarios: - upgrade RHEL 7.6 node kernel to version 3.10.0-957.14.1 or higher - access some directory on the RHEL 7.6 node, for example, cd dir1 - rename the directory name to the new name on the other cluster node, for example, mv dir1 dir2 - then, I/O error which occurs when accessing the renamed directory on the RHEL 7.6 node.
- Symptom: I/O error
- Platforms affected: All RHEL 7.6 OS environments with kernel version high than 3.10.0-957.14.1
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18078
- Problem description: A race between a thread handling node recovery and a thread trying to generate a DMAPI event caused an assert because of the status change of the DMAPI event status change.
- Work around: None
- Problem trigger: Node failure and threads accessing migrated files.
- Symptom: GPFS daemon failure
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18100
- Problem description: When a filesystem is being quiesced (for example create or delete of a snapshot), a type of rpc labelled "RecLockRetry" can become hung and create a deadlock between nodes. Messages similar to these can be found in the output of the "mmdiag --waiters" command - RemoteRetryThread: on ThCond, reason 'RPC wait' for RecLockRetry - Msg handler RecLockRetry: for In function "RecLockMessageHandler(RecLockRetry)", call to "kxSendFlock"
- Work around: None
- Problem trigger: Applications running on different nodes in the cluster are contending for conflicting advisory (fcntl) locks on the same file and one of these is releasing its lock at a time when the filesystem is being quiesced (for example, create or delete of a snapshot).
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18102
- Problem description: Recovery from previous failure crashes in the kernel due to the same memory being deallocated twice. The linux log reporting the BUG will have these 3 characteristic calls in the stack: [317959.895506] [<001fffff80085cb2>] cxiFreePinned+0x72/0xc0 [mmfslinux] [317959.895515] [<001fffff80087f9c>] cxiFcntlLock+0x54c/0x718 [mmfslinux] [317959.895709] [<001fffff812416ac>] _Z17RecLocModuleResetj+0x1c0/0x358 [mmfs26]
- Work around: None
- Problem trigger: A previous abnormal GPFS daemon shutdown (crash)
- Symptom: Abend/crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18315
- Problem description: audit.log is showing a deny regarding logrotate for the following logs: cesdr-log:/var/adm/ras/mmcesdr.log fileaudit-log:/var/adm/ras/mmaudit.log mmprotocoltrace:/var/adm/ras/mmprotocoltrace.log mmwatch-log:/var/adm/ras/mmwatch.log mmwfclient-log:/var/adm/ras/mmwfclient.log msgqueue-log:/var/adm/ras/mmmsgqueue.log tswatchmonitor-log:/var/adm/ras/tswatchmonitor.log watchfolders-log:/var/adm/ras/mmwf.log
- Work around: Manually fix SELinux security file context to allow logrotate to access the above log files.
- Problem trigger: logrotate activity
- Symptom: Error output/message
- Platforms affected: Linux with SELinux enabled in enforcing mode
- Functional Area affected: All Scale Users with components that use the Linux logrotation utility.
- Customer Impact: Suggested: has little or no impact on customer operations
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18105
- Problem description: LOGASSERTFAILED: CP->GET_STCP() != 0 SHHASHS.C:1689
- Work around: None
- Problem trigger: Node failure and file deletions
- Symptom: GPFS daemon failure
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18309
- Problem description: The disk size (number of sectors) was uninitilized after the disk was reopened. It caused an IO error when writing the disk descriptor.
- Work around: None
- Problem trigger: Stop and start the disk.
- Symptom: IO Error
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18109
- Problem description: The ACL file contains all the Access Control Lists for the filesystem. In order to maximize parallelism among nodes as they access ACLs, the blocks of this file can be cached on each node. If during the process of reclaiming unused ACL space, deletions are broadcast to nodes, these messages may include an updated count of the number of ACLs that exist, and the nodes receiving this message need to update the header that resides in block 0 of the ACL file. A problem exists where nodes are mis-matched replicas for this block (inode 4 block 0); messages like "Error in inode 4 snap 0: Record block 0 has mismatched replicas". This problem may also cause ACL garbage collection to run too frequently.
- Work around: None
- Problem trigger: As unique ACLs are created in the fileystem, the ACL file grows and as the number doubles, garbage collection (reclaim of space no longer used) runs automatically (so adding ACLs can be one trigger). Other conditions that contribute to making this issue visible include having some metadata disks down at the time that the collection happens to be running. (This can result in subsequent mmfsck runs finding mis-matched replicas for block 0 of the ACL file, which is inode 4).
- Symptom: Error output/message
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18110
- Problem description: In a file system a directory is being deleted on a node; from another node in the cluster, some operation generates asynchronous attempt to obtain a conflicting lock to the same directory. This may cause a kernel crash in the pathname look up procedure on the first node. This is a timing issue and difficult to hit.
- Work around: None
- Problem trigger: A large workload of recursive directory deletion.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18144
- Problem description: Poor readdir performance on a shared directory after workload caused local node to have lock on the entire directory.
- Work around: Avoid multiple "ls" on a shared directory.
- Problem trigger: Performing repeated readdir and lookup on a shared directory.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18165
- Problem description: If the clock jumps forward more than 160 seconds, then the internal command "tsctl nQstatus -Y" will return a status of "unresponsive". This will trigger the gpfs_unresponsive event and causing CES IPs to failover to other nodes.
- Work around: None. The transient "unresponsive" state is self corrected within a short period of time (10 seconds).
- Problem trigger: System clock jumps backward or forward by more than 160 seconds.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Operating System environments
- Functional Area affected: GPFS
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18136
- Problem description: Deadlock on SGManagementMgrDataMutex could occur during buffer steal.
- Work around: None
- Problem trigger: Buffer steal triggered due to running low on free buffers or a change in token manager assignment.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18137
- Problem description: GPFS daemon assert: Assert exp(i < nServers). This could happen when the number of manager nodes in the cluster is more than the maxTokenServers configuration setting which defaults to 128.
- Work around: Either reduce number of manager nodes or increase the maxTokenServers setting.
- Problem trigger: Number of manager node exceeds the maxTokenServers setting.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18171
- Problem description: The mmdf command hangs with a long waiter waiting for free space recovery, then in turn blocking subsequent conflicted commands.
- Work around: None
- Problem trigger: Run mmdf command on FPO file system while there are I/O workloads in progress.
- Symptom: mmdf hang.
- Platforms affected: ALL Operating System environments except AIX and Windows
- Functional Area affected: FPO
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18238
- Problem description: tslspool34 is called by GPFSPool zimon sensor which may cause cluster wide token contention on root directory of a file system
- Work around: None
- Problem trigger: tslspool34 is called by GPFSPool zimon sensor which will run once every 5 minutes. Per 5.0.3, zimon sensor will run on all nodes which will cause much token contention Since 5.0.3.1, zimon sensor runs only on restricted nodes.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: perfmon (Zimon)
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18138
- Problem description: FSSTRUCT error could be issued during file creation if file system runs out of disk space for metadata.
- Work around: None
- Problem trigger: Running out of metadata disk space
- Symptom: Error output/message
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18316
- Problem description: When mmfsck is run on a file system having corruption in the inode allocation map the file system manager node of the file system being scanned can assert with - logAssertFailed: !"SeverityNone for FSTATUS_UNFIXED"
- Work around: Disable the fsck patch queue feature on all nodes using this command, mmdsh -N all mmfsadm test fsck usePatchQueue 0, then rerun the mmfsck command.
- Problem trigger: This issue will affect customers running mmfsck on IBM Spectrum Scale V4.2.3 or higher where they have additional reserved inode marked free corruption in the inode allocation map.
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: FSCK
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18174
- Problem description: Long waiters which wait for "RDMA read/write completion fast" because in some cases RDMA requests pending in GPFS internal list may not be processed
- Work around: None
- Problem trigger: On a high load nsd server or GSS/ESS server which has verbsRdma enabled, RDMA requests may be queued in list if current in flight RDMA request count of a connection exceeds verbsRdmasPerConnection. In mutex conflict condition, they may not be processed when RDMA connection is closed or reconnected, and resulting in long waiters.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18175
- Problem description: When UID remapping is enabled IO performance is reduced due to the incorrect caching of the supplementary gids.
- Work around: None
- Problem trigger: Remote cluster mount with UID remapping enabled
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Operating System environments
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18139
- Problem description: Currently AFM has a default 8 seconds wait time before it can timeout a mount request for the remote export that is attempting to be mounted. In some cases there might be a real network delay that customer might want to have this 8 seconds value increased. Hence the need for a separate configurable.
- Work around: Currently there is a daemon level tunable - "mmfsadm afm
mountWaitTimeout
" - which needs to be enabled on the affected AFM gateway node. But there's no way to know its configured value or the nodes that have the parameter defined. So we're making a global tunable which makes it easier to control the parameter and on which nodes the value needs to be tuned - Problem trigger: Perform the first IO to AFM fileset with large delay between the cache/primary gateway node to the home/secondary NFS serving node.
- Symptom: Network Performance.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM caching and AFM DR
- Customer Impact: Suggested.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18177
- Problem description: File was mapped with shmat() and attempted to read past end of file (EOF). This causes a pagefault. The GPFS pagefault handler generated a kernel panic when it found the kernel buffer it is trying to transfer to a user buffer is not valid. The kernel crashes with the following error: kernel panic, assert !lcl._wasMapped)
- Work around: None
- Problem trigger: Reading last block of a file mapped with shmat()
- Symptom: Kernel panic
- Platforms affected: AIX
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18178
- Problem description: Hard coded path to tool not matching this Linux distrobution.
- Work around: Create sym-link from /sbin/ibportstate to /usr/sbin/ibportstate
- Problem trigger: SuSE or Debian Linux and IB networking
- Symptom: Unexpected Results/Behavior
- Platforms affected: SuSE Linux
- Functional Area affected: System Health
- Customer Impact: Suggested. has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18180
- Problem description: Kernel crashed due to checking for the wrong lock status of a mapped file.
- Work around: None
- Problem trigger: Reading mapped file.
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18193
- Problem description: When client nodes are at a level containing the optimization for file creation in an empty or small directory while the token managers are not, there is a chance that certain token revokes will enter into infinite retries
- Work around: In a mixed cluster or multicluster environment, only let nodes with the optimization code play the manager role.
- Problem trigger: The client nodes have the optimization for file creation in an empty or small directory while the token managers do not, and nodes from multiple client clusters try to access the same empty or small directory in which files are being created.
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18321
- Problem description: When offline mmfsck in read-only mode is run on a file system with a down NSD it will output an incorrect error message suggesting user to again restart file system check in read only mode
- Work around: Ignore the message and bring the down disks back online or else if the disks cannot be brought back online then they will have to be deleted in order to run file system check.
- Problem trigger: This issue will affect customers running mmfsck on IBM Spectrum Scale V5.0.2 or higher when running mmfsck on a file system having a down NSD.
- Symptom: Incorrect error message
- Platforms affected: ALL
- Functional Area affected: FSCK
- Customer Impact: Suggested.
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18182
- Problem description: mmdeldisk or mmdf commands hang and wait on free space recovery. This happens when a node doesn't relinquish all the block allocation regions it owned during the process of unmounting a file system.
- Work around: Restart Spectrum Scale service on file system manager node.
- Problem trigger: Unmount the file system from a node.
- Symptom: File system or file operations hang.
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18140
- Problem description: Deadlock while changing the gateway node attribute using the mmchnode --gateway/--nogateway command.
- Work around: None
- Problem trigger: Gateway node change using the mmchnode command.
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ17825
- Problem description: Because of a race between the handling of memory mapping and normal reading of the same file a read from the last block of that mapped file returned wrong data.
- Work around: None
- Problem trigger: Multiple processes reading last block of a file and memory mapping of that same file.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: All Scale Users
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18185
- Problem description: GPFS asserts during mmcheckquota command when it encounters invalid fileset ids in the quota file.
- Work around: None
- Problem trigger: Invalid fileset ids, likely originating from deleted files, were erroneously inserted into the quota file causing an assertion in the mmcheckquota command.
- Symptom: GPFS terminates during mmcheckquota command due to assertion.
- Platforms affected: ALL
- Functional Area affected: Quota
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18142
- Problem description: A kernel bugcheck with code PAGE_FAULT_IN_NONPAGED_AREA (50) can occur during mmmount on Windows 10 or Windows Server 2016.
- Work around: None
- Problem trigger: Running mmmount on Windows 10 or Windows Server 2016.
- Symptom: Abend/Crash
- Platforms affected: Windows/x86_64 only, specifically Windows 10 and Windows Server 2016.
- Functional Area affected: Windows mount.
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18147
- Problem description: There is a minor performance degradation in queueing of AFM IO request from the application node to the gateway node due to an inefficient algorithm for identifying the correct AFM gateway node.
- Work around: There is no serious impact without the fix, only slower AFM IO performance.
- Problem trigger: Performing IO to a AFM fileset from an application.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux and AIX OS environments (AIX as application nodes only)
- Functional Area affected: AFM caching and AFM DR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18216
- Problem description: In a large cluster, file creation times can take as much as 15 seconds to complete on some nodes. This is because of the high default value of maxActiveIallocSegs which causes some nodes to use more inode allocation segments leading to starvation in other nodes.
- Work around: Reduce maxActiveIallocSegs config parameter value.
- Problem trigger: nNodes(local + remote)*maxActiveIallocSegs > available inode allocation records in a fileset (ie. nRegions)
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: High Importance
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18155
- Problem description: If multiple file systems are granted access to a remote cluster, the mmauth show -Y output for that cluster appears in multiple line like the regular output.
- Work around: None
- Problem trigger: A remote cluster with access to multiple file systems. The mmauth show -Y output is not in standard format.
- Symptom: Output format
- Platforms affected: ALL
- Functional Area affected: GUI/System Health
- Customer Impact: Suggested: has little or no impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18191
- Problem description: When debugging is enabled for mmbackup, tsqosnice is called to query QOS and then tsqosnice may terminate with a stack smashing error.
- Work around: Do not use mmbackup debugging or remove the call to tsqosnice from the mmbackup script.
- Problem trigger: See problem description.
- Symptom: Stack smashing error message
- Platforms affected: Linux
- Functional Area affected: QOS, mmbackup
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18333
- Problem description: logAssertFailed: ofP->inodeLk.get_lock_state() != 0 || ofP->mnodeLk.get_lock_state() != 0 || ofP->metadata.mnodeFast.fastpathIsEnabled(0x04000000) && ofP->metadata.mnodeFast.fastpathGetCount() > 0
- Work around: Disable this logAssert by 'mmchconfig disableAssert' on releases which have 'disableAssert' configuration
- Problem trigger: Mnode token revoke while gpfs in fast path of file read/write
- Symptom: Abend/Crash
- Platforms affected: ALL
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18217
- Problem description: A program running against a gpfs filesytem with TCT installed can fail with ENOENT error.
- Work around: None
- Problem trigger: On a TCT enabled file system, if a file is unlinked and later an attempt is made to access the file, it can results in an ENOENT error.
- Symptom: ENOENT error
- Platforms affected: Linux Only
- Functional Area affected: TCT
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18192
- Problem description: GPFS has a specified but unenforced limit of 256 CPUs. It has an internal table limit at 1024 CPUs. If a large system presents to LINUX a possible configuration of more than 1024 CPUs then GPFS will generate a LOGASSERT for the unexpected configuration.
- Work around: The work arounds are to revert the GPFS version or to configure the system firmware to present to LINUX fewer CPUs. Note that 'lscpu' and related commands won't show the possible limit that LINUX and GPFS must be ready for. The easiest way to view and vette what the system is presenting to LINUX and GPFS is to examine: ' cat /sys/devices/system/cpu/possible' and verify the value is < 1024 without this fix and < 1536 with this fix. It is very system specific how to partition and or re-configure a system to present a lower limit. Consult your system documentation.
- Problem trigger: This triggers if 'cat /sys/devices/system/cpu/possible' > 1024 without this fix or > 1536 with this fix. The code was introduced in 5.0 . A system that violated this limit before release 5.0 would not have encountered this limit.
- Symptom: The system will LOGASSERT with the following message: "GPFS daemon crash logAssertFailed: ucNumCpus <= 1024"
- Platforms affected: Platforms affected are any large system, such as E980, that can be provisioned for more than 1024 CPUS
- Functional Area affected: Per-cpu I/O counters; All users with this feature / functional area.
- Customer Impact: Low - infrequent encounter of triggering system
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18220
- Problem description: AFM gateway node gets remote assert while reading the data from the AFM home cluster as it does not block the filesystem quiesce
- Work around: None
- Problem trigger: Read operation of the uncached files on the AFM caching filesets.
- Symptom: Abend/Crash
- Platforms affected: All Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18218
- Problem description: The ganesha NFS server can fail a NFS operation fails with EPERM and TCT is installed on the file system.
- Work around: None
- Problem trigger: During the NFS operation on a fd, if the connected dentry to a file can't be obtained, the operation is failed.
- Symptom: EPERM error
- Platforms affected: Linux Only
- Functional Area affected: NFS
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18219
- Problem description: There were two CES nodes, each one member of two CES groups. There were also four CES IPs, with two of them for each of the two CES groups assigned. The expectation was that each node gets one IP for each group (two IPs per node, one per group). That was not always the case. Sometimes one node got one IP, and the other one three.
- Work around: IPs can be moved manually at any time to a node using "mmces address move --ces-ip xxx --ces-node yyy"
- Problem trigger: The even-coverage IP balancing is done on a "per-group" basis among all nodes assigned to the same group. Additionally, the amount of already hosted IPs (member of the current or other groups) is considered in order to assign new IPs to nodes with the least assigned IPs That kind of logic did not always work as intended because during the startup phase the nodes may become healthy at different points in times. This impacts the IP movements because that is started as soon as the first node becomes healthy. If some nodes are healthy at a later time, then IP rebalancing is done to give them also IP addresses. That kind of logic did not work under all circumstances, so that sometimes a misbalanced but stable state remains.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All Linux OS environments
- Functional Area affected: CES
- Customer Impact: little impact on customer operation
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18221
- Problem description: On a file system without replication, it is possible for file system to panic with error 218 without additional information to help identify the disk that caused the error.
- Work around: None
- Problem trigger: Disk IO error
- Symptom: Cluster/File System Outage
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18075
- Problem description: When the DMAPI function dm_getall_disp dampi() is called with bufLen = INT_MAX GPFS the function returns a buffer of smaller size due to integer overflow. This may cause memory corruption when moving data from this small buffer to a user buffer which is larger than the buffer allocated.
- Work around: None
- Problem trigger: DMAPI api calls that provide a buffer length of INT_MAX
- Symptom: Node hangs or crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All Scale Users with DMAPI enabled GPFS filesystem
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18475
- Problem description: Unexpected GPFS daemon assert during file, directory, or symlink create operation.
- Work around: None
- Problem trigger: File system configuration changes such as enable/disable encryption Sy
- Symptom: Abend/Crash
- Platforms affected: ALL Operating System environments
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18525
- Problem description: The file system which is used for dumping data is monitored and if it fills up will cause the GPFS component in mmhealth to show Failed thus triggering CES failover
- Work around: Have DataStructureDump point to a path / file system with enough free space
- Problem trigger: DataStructureDump path points to a almost full filesystem
- Symptom: Component Level Outage
- Platforms affected: ALL Linux OS environments
- Functional Area affected: System Health CES
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18535
- Problem description: The mmdiag --afm command was executed to fill the data for AFM fileset status and home exported path was NULL on deleted status of the fileset and this caused an assert.
- Work around: None
- Problem trigger: Deleting the fileset.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18541
- Problem description: If you were using Spectrum Scale 5.0.2.x or older and had temporary connectivity issues in the past, which happened during a call home upload, this could break all further uploads via the ECC Client, even if you upgrade to a newer Spectrum Scale version.
- Work around: ssh
rm -f /var/mmfs/callhome/log/ecc/rsENCallECCLock.dat - Problem trigger: You were using Spectrum Scale 5.0.2.x or older and had temporary connectivity issues in the past, which happened during a call home upload.
- Symptom: Component Level Outage
- Platforms affected: ALL Linux OS environments
- Functional Area affected: callhome
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18566
- Problem description: To calculate available space in a SMB share Samba queries the quotas for the uid of the user and the gid for the user's primary group. The assumption here is that new files will be created with the user as owning and the user's primary group as the owning group. This assumption is not correct for directories with the set-group-ID bit set. In that case, the owning group from the directory will be applied to newly created files.
- Work around: None
- Problem trigger: Directories with the set-group-ID bit set will trigger this issue.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Operating System environments supporting CES.
- Functional Area affected: SMB
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18476
- Problem description: The tsspectrumdiscover application does not send events to external kafka sink due to unsuccessful processing of GPFS Kafka messages
- Work around: None
- Problem trigger: Running tsspectrumdiscover application post Spectrum Scale 5.0.3.0
- Symptom: tsspectrumdiscover exit with error
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Watch Folder
- Customer Impact: Low
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18477
- Problem description: Security hardening for the 'ts' commands in /usr/lpp/mmfs/bin/ .
- Work around: Remove the setuid from the files in the /usr/lpp/mmfs/bin directory.
- Problem trigger: Executing commands with certain undocumented input.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: admin commands
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18502
- Problem description: In rare case, unmounting a GPFS file system may cause "kernel BUG at dcache.c:966 - dentry still in use (-128)" on linux-3.12, The race happens between shrink_dcache_for_umount() and token revoke (or gpfsSwapd)
- Work around: None
- Problem trigger: Unmount GPFS file system, the kernel panic is a very rare case.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments linux-3.12
- Functional Area affected: All
- Customer Impact: Suggested
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18548
- Problem description: The existing auto-recovery code does not handle descriptor only disks correctly and treats them as disks saving user data or metadata.
- Work around: Disable auto-recovery.
- Problem trigger: If auto-recovery is enabled and descOnly disks are configured in the cluster.
- Symptom: If auto-recovery is enabled and descOnly disks are configured in the cluster, when a node fails, auto-recovery will treat descOnly disks as data disks and might cause data replication downgrade.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: FPO
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18549
- Problem description: Restriping compressed files and then hit an assert on "wa" lock mode. This problem could only happen during restriping time on compressed files while these files are being truncated.
- Work around: Rerun the restripe command
- Problem trigger: Truncating compressed files while restripe is in progress.
- Symptom: Abend on restripe process
- Platforms affected: All
- Functional Area affected: File compression and restripe
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18677
- Problem description: Get the following error message in mmfs.log: Unexpected data in message. Header dump: XXXXXXXX XXXX, and daemon may crash because LOGSHUTDOWN is called
- Work around: None
- Problem trigger: Bad network and reconnect is attempted
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18680
- Problem description: Hit the following assert after changing cipherList to AUTHONLY without restarting daemon: logAssertFailed: secSendCoalBuf != __null && secSendCoalBufLen > 0
- Work around: None
- Problem trigger: cipherList is changed from a supported algorithm to AUTHONLY without restarting daemon and reconnect is attempted
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18590
- Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when a blocked lock waiter is reset by GPFS during daemon cleanup process, the inode object is not freed and left in the slab cache. Later, GPFS may access the legacy inode structure data, which causes kernel crash.
- Work around: None
- Problem trigger: A large fcntl locking workload and daemon cleanup process.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18552
- Problem description: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats" hits the reported segmentation fault due to NULL pointer dereference.
- Work around: None
- Problem trigger: Running "/usr/lpp/mmfs/bin/mmfsadm vfsstats"
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18591
- Problem description: The NFS monitor checks the health state of a running NFS instance periodically. Sometimes the NFS service does not react on some "alive" check commands, and that is interpreted as a potential "hung" state. Based on the configuration in the mmsysmonitor.conf file either a failover or just a warning is triggered then.
- Work around: The behavior or a detected potential "hung" state can be customized with the flag 'failoverunresponsivenfs' in the mmsysmonitor.conf file, section [nfs]. The meaning of the flag value is: "true" = set an ERROR event (nfs_not_active) if NFS does not respond to NULL requests and has no measurable NFS operation activity "false" = set an DEGRADED event (nfs_unresponsive) if NFS does not respond to NULL requests and has no measurable NFS operation activity
- Problem trigger: In some cases high I/O load lead to the situation that NFS v3 and/or v4 NULL requests failed, and that a following internal statistics check reported no activity in respect to the number of internal NFS operations. These checks are done within a time span of several seconds to a minute. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily so that a failover would not be advised in this case. The monitor interprets the "unresponsiveness" as a potential "hung" state, and triggers either a failover or a warning, dependent on the configuration settings.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: High
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Item: IJ18678
- Problem description: Accessing the .snapshots snaplink directory generates an I/O error, while creating or deleting snapshots for the same file system or fileset.
- Work around: Stop the process accessing the .snapshots directory after getting I/O error, then retry the access to it again.
- Problem trigger: This problem could be triggered by snapshot create and deletion operations.
- Symptom: I/O error
- Platforms affected: All Linux OS environments with kernel versions between 3.10.0-957.21.2 and 4.x.
- Functional Area affected: Snapshots
- Customer Impact: Critical
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ17825 IJ18075 IJ18076 IJ18078 IJ18100 IJ18102 IJ18105 IJ18109 IJ18110 IJ18136 IJ18137 IJ18138 IJ18139 IJ18140 IJ18142 IJ18144 IJ18147 IJ18155 IJ18165 IJ18171 IJ18174 IJ18175 IJ18177 IJ18178 IJ18180 IJ18182 IJ18185 IJ18191 IJ18192 IJ18193 IJ18216 IJ18217 IJ18218 IJ18219 IJ18220 IJ18221 IJ18238 IJ18309 IJ18315 IJ18316 IJ18321 IJ18333 IJ18475 IJ18476 IJ18477 IJ18502 IJ18525 IJ18535 IJ18541 IJ18548 IJ18549 IJ18552 IJ18566 IJ18590 IJ18591 IJ 18605 IJ18677 IJ18678 IJ18680.
Problems fixed in IBM Spectrum Scale 5.0.3.2 [July 18, 2019]
- Problem description: There will be a long waiter like below: Waiting 8349.1305
sec since 00:03:05, monitored, thread 133060 AcquireBRTHandlerThread: on
ThCond 0x3FFE74012E78 (MsgRecordCondvar), reason 'RPC wait' for tmMsgBRRevoke
on node 192.168.117.82
- Work around: No
- Problem trigger: race condition between handling an inbound connection and node joining
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High IJ17133
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmlslicense command with the -Y option is displaying product edition information for all nodes in the list from the local node information. This is incorrect. It should only display the edition for the local node only and "-" for all other nodes. All of the other options on this command only display the local edition information as well.
- Work around: Ignore edition information any node that is not the local node.
- Problem trigger: Just running the command with a 2 node or more cluster
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: Suggested IJ17136
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmlsfileset command with "-i -d" options could run into an infinite loop when there is no enough free memory and indirect block descriptors in system. In addition, the similar loop issue could happen during mmrestripefs, snapshot deletion and ACL garbage collection processes.
- Work around: Increase the maxFilesToCache to allow more indirect block descriptors in cache. Also make sure there's enough free physical memory in system.
- Problem trigger: Run mmlsfileset -i -d, snapshot delete and mmrestripefs commands, or enable ACL, when no enough free physical memory in system with default or low configuration for maxFilesToCache parameter.
- Symptom: The mmlsfileset, snapshot delete and mmrestripefs commands hang there and other mm* commands cannot proceed as well. The background ACL garbage collection thread is running in a loop if ACL is enabled.
- Platforms affected: All
- Functional Area affected: mmlsfileset, mmrestripefs, snapshot delete commands and ACL garbage collection process.
- Customer Impact: Critical IJ16674
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmfs.log file may contain an entry like this: "[E] sdrServ: Communication error on socket /var/mmfs/mmsysmon/mmsysmonitor.socket, [err 79] Can not access a needed shared library"
- Work around: N/A. The reported error code "79" is internally used, and means "connection refused".
- Problem trigger: No recreate procedure available for the reported issue. The underlying issue was, that GPFS internal codes were not mapped to Linux system codes. That gave the wrong message text when printing the corresponding system message text for such a code.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: System Health
- Customer Impact: has little or no impact on customer operation IJ16707
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: File system unmounted when application overwrite data blocks
- Work around: None
- Problem trigger: Overwriting data block followed by disk down in the file system.
- Symptom: unmounted
- Platforms affected: All
- Functional Area affected: gpfs core
- Customer Impact: High Importance IJ16712
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: On RHEL7.6 node, with supported GPFS versions 4.2.3.13 or higher and 5.0.2.2 or higher, when kernel upgrade to version 3.10.0-957.19.1 or 3.10.0-957.21.2 (after apply RHBA-2019:1337) or higher, the node may encounter a kernel crash while running an IO operations.
- Work around: disable selinux
- Problem trigger: An inconsistency between the GPFS kernel portability layer and the kernel level
- Symptom: Abend/Crash
- Platforms affected: RHEL7.6 with kernel 3.10.0-957.19.1 or higher
- Functional Area affected: All
- Customer Impact: High Importance IJ16783
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A user may create a file system with an unhealthy number of allocated inodes in the root fileset. This can cause the inode allocation map to become sub optimal when creating further independent filesets that don't have as many allocated inodes. The only way to reformat the inode allocation map is to recreate the file system.
- Work around: Recreate file system with favorable inode allocation map parameters.
- Problem trigger: Create file system with very large NumInodesToPreallocate.
- Symptom: Performance Impact/Degradation
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ16716
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Raising the fsstruct_fixed event as stated in the documentation will not work and returns an error in version 5.0.2-x instead.
- Work around: Include the file system name two times as arguments of mmsysmonc to raise fsstruct_fixed
- Problem trigger: Spectrum Scale Version 5.0.2-x is installed
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: System Health
- Customer Impact: Suggested: has little or no impact on customer operation IJ16782
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmlslicense --capacity fails to report the correct disk size
- Work around: Manually getting the disk size from blockdev command.
- Problem trigger: Underlying device names are not found on all NSD servers
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Admin Commands
- Customer Impact: Suggested: has little or no impact on customer operation IJ16678
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: There are 3 problems: 1. if an upload of a file >2GB crashes, this blocks all further not service ticket-related uploads of call home forever 2. the call home feature of resending of failed scheduled uploads does not work 3. If any of call home group members crashed during the data collection, mmsysmonitor.log on the group master will have a persistent repeating error entry in its log
- Work around: For 3 aforementioned issues: 1. in LOCKINFO
(/var/mmfs/callhome/log/ecc/rsENCallECCLock.dat) change FILE_SIZE to a value,
which is less than 2G 2. none 3. delete on the call home master node the
contents of
/callhome/incomingFTDC2CallHome - Problem trigger: 1. upload of files >2GB, which are not service ticket-related 2. instable connection to ECuRep 3. call home group members crashing during the call home scheduled data collection
- Symptom: Component Level Outage
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Callhome
- Customer Impact: Suggested IJ17147
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A node in the home cluster hit the following assertion when a remote node joins the cluster: 2019-04-16_14:55:37.346+0200: [X] logAssertFailed: (nodesPP[nidx] == NULL || nodesPP[nidx] == niP)
- Work around: No
- Problem trigger: remote node joins and leaves the cluster
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ16676
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Trying to clear the READONLY attribute of an immutable file through SMB succeeded within the retention period.
- Work around: No
- Problem trigger: A Windows SMB client is trying to clear the READONLY attribute on an immutable file that has not expired.
- Symptom: Error output/message
- Platforms affected: Windows Only
- Functional Area affected: SMB/Immutability
- Customer Impact: High Importance IJ17524
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When an encryption policy references a key identifier that is longer than 64 characters, policy application fails.
- Work around: No
- Problem trigger: Create an encryption policy that references a key identifier which is longer than 64 characters and attempt to apply the policy
- Symptom: Policy application fails.
- Platforms affected: All
- Functional Area affected: encryption
- Customer Impact: Low IJ17569
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory leak when the gateway node joins the cluster. Reply data is not freed after obtaining the lead gateway node. Lead gateway functionality is no longer used.
- Work around: No
- Problem trigger: Gateway node joining the cluster.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17534
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Memory leak when the gateway node is not yet ready to handle the requests when the node designation is changed
- Work around: No
- Problem trigger: Gateway node joining the cluster.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17537
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Advisory locks are recorded in the Linux kernel on the local node via file_lock structures, and GPFS maintains an additional structure to accomplish locking across nodes. There are times when inode object was freed, companioned with a blocked lock waiter is resumed by GPFS, GPFS will try to free the file_lock along with the GPFS structure, and access the obsolete inode structure data, which causes kernel crash.
- Work around: No
- Problem trigger: A large fcntl locking workload and lock contention.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: High Importance IJ17471
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Today we're going through all the nodes in the cluster (including those remote cluster nodes that mount the local filesystem), to find the single Gateway node for the fileset to which we need to queue the application IO request that is generated. On clusters having huge number of remote cluster mounted nodes, this causes a considerable application performance degradation.
- Work around: No
- Problem trigger: Have a large number of remote cluster nodes mounting the filesystem from the owning cluster. (customer has about 9000 such nodes mounting the FS). Now every time an application node sends request to the gateway node - in order to find the gateway node it needs to go through the entire list of 9K nodes to find this single node. In similar fashion the gateway node also needs to confirm that it is indeed the serving gateway node for the request sent. Verifying from the 9K node list. This takes up considerable amount of time in the application IO path to queue the request from the app node to gateway and ack from gateway back to application node in order to complete the application IO request.
- Symptom: Silent performance degradation for the applications performing IO to the AFM fileset.
- Platforms affected: ALL Linux OS environments (AFM Gateway nodes). All Linux and AIX environments (Application nodes running IO to the AFM fileset).
- Functional Area affected: AFM - NFS and GPFS backend filesets. with afmHashVersions 2 and 5. With afmFastHashVersion tunable turned on.
- Customer Impact: High Importance IJ17170
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: getxattr for 'security' namespace is not well blocked during quiesce that may cause assert "SGNotQuiesced"
- Work around: No
- Problem trigger: When file system is quiesced (for example when run mmcrsnapshot/mmdelsnapshot), all vfs operations should be blocked. If there are applications which accessing file's 'security' namespace extended attributes (for example 'getcap' command), that getxattr vfs operation is not well blocked and may cause assert "SGNotQuiesced"
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: Snapshots
- Customer Impact: High Importance IJ17112
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When RDMA connection is in bad situation, the new NSD requests will go remaining RDMA connections. The in flight NSD requests will fail back to TCP socket for them even there are still other remaining RDMA connections.
- Work around: No
- Problem trigger: port or link error on node which has multi IB ports
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: Suggested IJ17172
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM prefetch does not work if the files have 64 bit inode numbers assigned to them. When checking the file for the cached bit, 32 bit inode number is used and the integer overflow might cause file's cached state to be returned as true.
- Work around: No
- Problem trigger: AFM prefetch
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17557
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Primary fileset might run out of inode space if large number of files are created/deleted.
- Work around: No
- Problem trigger: Inode space might be exhausted.
- Symptom: Abend/Crash
- Platforms affected: Linux Only
- Functional Area affected: AFM DR
- Customer Impact: IJ17175
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When the UID remapping is enabled, daemon asserts or the kernel crash occurs on the nodes in the client cluster. This happens when the remapping scripts does not remap any credentials or the enableStatUIDremap is not enabled.
- Work around: 1. For the daemon assert, correct the remap scripts to remap the credentials 2. For the kernel crash, enable enableStatUIDremap config option
- Problem trigger: UID remapping with incorrect mmname2uid script and file metadata modification when enableStatUIDremap is not enabled.
- Symptom: Abend/crash
- Platforms affected: All
- Functional Area affected: Remote cluster mount/UID remapping
- Customer Impact: High Importance IJ17114
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM prefetch on the small sized files have performance issue as the file is flushed to the disk without closing the open instance. This causes file not to be shrunk to fit into the subblocks and the full block of data is transferred to the NSD server.
- Work around: No
- Problem trigger: AFM prefetch
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17576
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Data write operation is being performed if file was already synced but migrated at secondary in role reversal feature. If file(s) are migrated then write operation should be skipped in role reversal and set only attrs.
- Work around: None
- Problem trigger: Migrated files are there in role reversal
- Symptom: Write operation is happened on migrated file.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Suggested IJ17570
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPC message was reported as lost, like below: Message ID 735239 was lost by node ip_address node_name wasLost 1
- Work around: None
- Problem trigger: Network is not good which leads to reconnect happening several times
- Symptom: Node expel/Lost Membership
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ17538
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSSTRUCT error: FSErrValidate could be generated in system log after adding new disk to a file sytem.
- Work around: None
- Problem trigger: Add new disk to a file system while running GPFS 5.0.1.0 thru 5.0.3.1
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ17554
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: After reboot of a node the systemhealth NFS monitoring was started, but not the SMB component and monitoring. AD authentication was configured for NFS, which depends on a running SMB component. This constellation yield to a "winbind-down" event, but gave no hint about the root cause
- Work around: mmshutdown followed by mmstartup might help, since the entire stack (including SMB/NFS and their monitors) are restarted. The log level could be increased during the startup and check phase (mmces log level 3) to get more details in the mmfs.log file. For production, this log level should be lowered ( to 0 or 1).
- Problem trigger: The circumstances which may lead to the detected mismatch were not repeatable. This seems to be a rare race situation, and was not reported before.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: High Importance IJ17559
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmexpelnode fails, when cluster manager and file system managers network cable pulled for CCR enabled clusters and tiebreaker disks configured. GPFS file systems got unmounted on other hosts.
- Work around: None
- Problem trigger: mmexpelnode executed in a CCR enabled cluster with tiebraker disks configured.
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: Admin Commands (mmexpelnode)
- Customer Impact: High Importance IJ17580
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When accessing gpfs zlib compressed file by mmap (or execute a gpfs zlib compressed executable file), kernel may crash with oops message "unable to handle kernel paging request" at IoDone routine
- Work around: None
- Problem trigger: accessing gpfs zlib compressed file by mmap (or execute a zlib compressed executable file)
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: GPFS Native Compression
- Customer Impact: High Importance IJ17593IJ17593
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Deadlock when AFM filesets are accessed using the remote mounted file system due to the mismatch in the gateway node configuration between client (remote) and storage (home) clusters. It is unclear how the configuration mismatch happens.
- Work around: None
- Problem trigger:
- Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters
- Platforms affected: All OS environments
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical IJ17581I
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A stripe group / file system manager panic occurs while another node (non-SGmgr) is accessing files in a snapshot. These accesses can be part of the snapshot deletion itself, or another maintenance command (such as mmdeldisk or mmrestripefs), or even ordinary user accesses from the kernel. The diagnostic error reported in the log on the stripe group (SG) manager node looks like this, though the line number may vary: 2019-05-06_23:23:22.122-0300: [X] File System fs1 unmounted by the system with return code 2 reason code 0, at line 4646 in /afs/apd.pok.ibm.com/u/gpfsbld/buildh/ttn423ptf13/src/avs/fs/mmfs/ts/fs/llio.C The "unmount in llio.C" message is usually followed by a message mentioning "Reason: SGPanic", but this does not always occur, and a SGPanic can be caused by other unrelated problems. The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be difficult to determine which access or which node caused the panic. Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
- Work around: The work-around is to remove DeleteRequired snapshots with an mmdelsnapshot command with an explicit -N argument listing only the SG manager node.
- Problem trigger: The error is triggered by a snapshot listed as DeleteRequired by mmlssnapshot. The snapshot access that causes the error, however, will be to an earlier snapshot (with smaller snapId); though it may be be to an earlier snapshot (with smaller snapId); though it may be Further, at least one snapshot must be a fileset snapshot (file systems with only global snapshots, are not affected). The specific enabling factors, however, are complicated and quite rare for most customers, so this is not a common problem.
- Symptom: Cluster/File System Outage
- Platforms affected: All OS environments
- Functional Area affected: Snapshots
- Customer Impact: Suggested: has little or no impact on customer operation IJ17595
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: For AFM migration, provide an option to revalidate with home once after the cut over to the new system for the performance improvement during the fileset access.
- Work around: None
- Problem trigger: AFM migration
- Symptom: Performance Impact/Degradation
- Platforms affected: All OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17582
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: gnrhealthcheck is not catching the case where an ESS system is setup without having verified that that both servers see the enclosures/drives.
- Work around: None
- Problem trigger: This problem is caused by an invalid ESS deployment.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: ESS/GNR
- Customer Impact: Suggested: IJ17583
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: When running I/O with NFS unexpected failovers occurred without an obvious reason. NFS is reported as 'not active', even it is still working.
- Work around: No workaround available. There is a manuel way to temporary modify the event declaration for the observed "nfs_not_active" event by modifying the event action in the event configuration file ( ask L2 for support).
- Problem trigger: In the reported cases some high I/O load lead to the situation that NFS v3 and/or v4 (whatever is configured) NULL requests failed, and that a following internal statistics check reported no activity regarding the number of internal NFS operations. The monitor interpreted this as a "hung" state and triggered a failover. In fact, the system might be still functional, and the internally detected "unresponsive" state might be just temporarily, so that a failover is not advised in this case. However, at the time of monitoring there was no further indication available.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: High Importance IJ17598
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: QOS may deadlock on the file system manager node, particularly if there are many (hundreds) of nodes mounting the file system and the manager node is is heavily CPU or network loaded.
- Work around: 1) mmchqos FS stat-slot-time 15000 stat-poll-interval 60 or if that is not sufficient... 2) Disable QOS until fix is available.
- Problem trigger: See problem description.
- Symptom: Hang or Deadlock
- Platforms affected: ALL
- Functional Area affected: QOS
- Customer Impact: High Importance, especially for customers using QOS with hundreds of nodes mounting the file system. IJ17584
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A filesystem containing a dot in the name was declared as
to be ignored by declaring a file /var/mmfs/etc/ignoreAnyMount.
. However, the systemhealth monitor treated it as a missing filesystem. - Work around: No work around available. Filesystems could be named with an underscore instead of a dot, if a separator is wanted
- Problem trigger: A filename /var/mmfs/etc/ignoreAnyMount.
is split internally by dots, so that it results in three items (which is not wanted): /var/mmfs/etc/ignoreAnyMount filesystemWith dot - Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: Systemhealth
- Customer Impact: little impact on customer operation IJ17600
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Customer cannot create a smb export under specific conditions.
- Work around: Choose names of gpfs file systems while no file system is a substring of any other
- Problem trigger:
- Symptom: Customer is limited to special setup for his gpfs file systems
- Platforms affected: ALL Linux OS environments
- Functional Area affected: SMB
- Customer Impact: Suggested IJ17585
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If a relative pathname is provided in an export definition, the mmnfs command will allow it which will cause the Ganesha NFS server to fail.
- Work around: None
- Problem trigger: Relative pathname to --pseudo option of the mmnfs command.
- Symptom: Unexpected results.
- Platforms affected: Linux
- Functional Area affected: Protocols
- Customer Impact: Suggested IJ17607
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM is unable to prefetch the data if the file metadata is changed. For example if the user changes the metadata (ex. chmod) on the uncached file, prefetch skips reading the file.
- Work around: Read the file manually without the prefetch
- Problem trigger: AFM prefetch
- Symptom: Unexpected Results/Behavior
- Platforms affected: All OS environments
- Functional Area affected: AFM
- Customer Impact: High Importance IJ17601
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: "mmhealth node show" might show degraded status for CLOUDGATEWAY even though "mmcloudgateway service status -N tctServers" shows all OK
- Work around: None
- Problem trigger: If Cloudgateway was in a degraded state and changed to "only_ensures_cloud_container_exists" status it did not trigger mmhealth to go to a "healthy" state.
- Symptom: Unexpected Results/Behavior
- Platforms affected: Linux
- Functional Area affected: System Health TCT
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ17665
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ16674 IJ16676 IJ16678 IJ16707 IJ16712 IJ16716 IJ16782 IJ16783 IJ17112 IJ17114 IJ17133 IJ17136 IJ17147 IJ17170 IJ17172 IJ17175 IJ17471 IJ17524 IJ17534 IJ17537 IJ17538 IJ17554 IJ17557 IJ17559 IJ17569 IJ17570 IJ17576 IJ17580 IJ17581 IJ17582 IJ17583 IJ17584 IJ17585 IJ17593 IJ17595 IJ17598 IJ17600 IJ17601 IJ17607 IJ17665.
Problems fixed in Spectrum Scale 5.0.3.1 [May 31, 2019]
- Problem description: When creating DMAPI session there is a small window where memory is getting corrupted causing GPFS daemon crash with sig 11.
- Work Around: None
- Problem trigger: Creating lots of DMAPI sessions with heavy workload
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ15859
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: RPCs sending via RDMA are pending there forever and they
are in 'sending' state. Long waiters with Verbs RDMA like: Waiting 2273.0813
sec since 11:05:04, monitored, thread 113229 BackgroundSyncThread: for RDMA
send completion fast on node 192.168.1.1
- Work Around: None
- Problem trigger: Reply lost on RDMA network
- Symptom: Hang
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: High IJ15892
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If gpfs is shutdown on a node it it possible that ces ips are assigned to this nodes two minutes after shutdown. This ces ips are not usable for the customer.
- Work Around: Suspend node before gpfs shutdown.
- Problem trigger: Node has still a valid gpfs lease two minutes after shutdown.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL
- Functional Area affected: CES
- Customer Impact: High IJ15912
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A race condition may cause mmperfmon that update sensor fail with the following message: fput failed: Invalid version on put (err 807) Other commands fail with the above message as well.
- Work Around: Rerun the failed command.
- Problem trigger: Problem hit more often using spectrum command to install.
- Symptom: Error output/message "fput failed: Invalid version on put (err 807)" Upgrade/Install failure
- Platforms affected: ALL Operating System environments but more oftern on Linux nodes in CCR environment.
- Functional Area affected: Admin Commands
- Customer Impact: Suggested IJ16079
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: "mmuserauth service create" command failed due to TCP port 445 being blocked. However, error message indicated incorrect credentials which was not the correct reason for failure.
- Work Around: None
- Problem trigger: The issue is seen at the time of configuring Authentication, in those setups where TCP Port 445 is blocked. The command internally tries to connect to the DC specified via the Port. Due to blocked port, it fails to connect with a timeout. However, the error message shown currently indicates of incorrect credentials which is not the case.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: Authentication
- Customer Impact: Suggested IJ16084IJ16084
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrInodeCorrupted FSSTRUCT error could be written to system log as result of stale buffer for directory block.
- Work Around: None
- Problem trigger: Change in token manager list as result of either node failure or change in number of manager nodes.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: All
- Customer Impact: Suggested IJ16085
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The output of mmlscluster --ces show multiple entries for the same IP address. The cesiplist file (stored in ccr) did contain these multiple entries, so mmlscluster just displayed them. This was obviously a misconfiguration.
- Work Around: A reassignment of IPs (moves, failover,suspend/resume) triggers some rewrite of the cesiplist file, which cleans up these inconsistencies. It is necessary that the affected node is involved in the IP movement.
- Problem trigger: The circumstances which may lead to multiple IP entries of the same IP for a node is not known. This seems to happen occasionally, but very rarely.
- Symptom: Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: has little or no impact on customer operation IJ16091
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Unexpected wndb down after smb startup without know reason at log level 0.
- Work Around: Start wndb manually.
- Problem trigger: Unknown
- Symptom: Unexpected Results/Behavior
- Platforms affected: All
- Functional Area affected: CES
- Customer Impact: Medium Importance IJ16093
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Trying to delete an immutable file through SMB fails after the retention period expires. The problem is that Samba as SMB server denies deletion when the READONLY flag is set.
- Work Around: None
- Problem trigger: A Windows SMB client is trying to delete an immutable file after the retention period expires.
- Symptom: Error output/message
- Platforms affected: Windows Only
- Functional Area affected: SMB/Immutability
- Customer Impact: High Importance IJ16094
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If too many pdisks are unreadable (not missing) because of which we are not able to write to a vtrack, it is possible that we commit the stale strips information to metadata log. When scrubber tries to scrub the vtrack, it will examine this stale strip data and declare data loss.
- Work Around: None
- Problem trigger: Unavailability of pdisks to do a write vtrack.
- Symptom: IO error.
- Platforms affected: ALL Linux OS environments
- Functional Area affected: ESS/GNR
- Customer Impact: Critical IJ16095
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrCheckHeaderFailed error could be correctly issued and logged in the system log.
- Work Around: None
- Problem trigger: User application move files out of directory before deleting the directory.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ15910
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon will sig11 or log assert with "offset < ddbP->mappedLen" when user application, log recovery, tsdbfs or mmfsck command access a corrupted directory (directory's file size is smaller than 32 Bytes - the size of directory block header structure).
- Work Around: None
- Problem trigger: This kind corrupted directory could be caused by previous code bug.
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15909
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Few operations on the IW fileset will take longer than expected as there's a unintended dependency created on previous operations performed on the fileset which will be attempted to replicate to the remote/home side via the operation that is currently performed.
- Work Around: None
- Problem trigger: Users running 5.0.3 and running workload on AFM IW mode filesets should see a few elongated operations (performance impact) on the filesets owing to a few dependent operations performed on the same file/fileset earlier - which are waiting to be asynchronously pushed to the home/remote site.
- Symptom: Few operations on the IW fileset might take longer than expected - since it is working other asynchronous operations as its dependents to the remote site. Few waiters might be seen to linger for a few extra seconds and once the dependencies are resolved the waiters should vanish.
- Platforms affected: ALL Operating System environments (AFM application and Gateway nodes).
- Functional Area affected: AFM - and Specifically users on AFM IW mode filesets only.
- Customer Impact: High Importance. IJ16110
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Enable AFM prefetch for the single fileset to run from the multiple gateway nodes for improving the migration performance
- Work Around: None
- Problem trigger: AFM prefetch, slow performance
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Suggested IJ16112
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS daemon crash when application writing data into file system
- Work Around: None
- Problem trigger: A memory failure of newBuffer in a busy system.
- Symptom: Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15993
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Enable AFM prefetch for the single fileset to run from the multiple gateway nodes for improving the migration performance. This enhancement also handles the scenario where same file is being read from the multiple gateway nodes.
- Work Around: None
- Problem trigger: AFM prefetch, slow performance
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM
- Customer Impact: Suggested IJ16113
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: On a system without ifup/ifdown commands installed, nearly any call to a mm-command shows messages like which: no ifup in (/bin:/usr/bin:/sbin:/usr/sbin:/usr/lpp/mmfs/bin) which: no ifdown in (/bin:/usr/bin:/sbin:/usr/sbin:/usr/lpp/mmfs/bin) and terminate the called mm-program
- Work Around: Not available. An install of ifup/ifdown would resolve the issue, but might yield to other issues
- Problem trigger: Any mm-command may run into this issue if the ifup/ifdown commands are not installed on the system
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments
- Functional Area affected: CES
- Customer Impact: High Importance IJ16114
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmfsadm dump command could run into an infinite loop when dumping the token objects.
- Work Around: avoid to run mmfsadm dump command.
- Problem trigger: run mmfsadm dump command while workloads are running in the cluster.
- Symptom: mmfsadm dump command hang.
- Platforms affected: ALL Operating System environments except Windows
- Functional Area affected: mmfsadm dump command
- Customer Impact: Suggested IJ15996
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: If the file system was formatted with narrow disk address (2.2 version or older), and the gpfs version is 4.2.3 or 5.0.x version, GPFS daemon assert would happen randomly.
- Work Around: None
- Problem trigger: Application I/O into a narrow disk address file system by using 4.2.3 or 5.0.x GPFS versions.
- Symptom: Crash, likes assert subblocksPerFileBlock==(1<<(tinodeP->getFblockSize()))
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ16116
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The mmrepquota -q and -t option command usage is ambiguous. Options -q and -t should not be used when combined with Device:Fileset because they are file system attributes.
- Work Around: None
- Problem trigger: The current mmrepquota command usage allows invoking -q option as follows: mmrepquota -q Device:fileset
- Symptom: mmrepquota -q Device:fileset gives file system default quota information and not perfileset-quota.
- Platforms affected: All
- Functional Area affected: Quotas
- Customer Impact: Suggested IJ15914
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: For file systems created with large NumNodes and large NumInodesToPreallocate arguments, the inode allocation map ends up with a large value for nRegions and nBitsPerSubsegment. For subsequent independent filesets created with orders of magnitude less NumInodesToPreallocate, this can leave most of the inode map segments as unusable/surplus. During inode lookup as part of inode allocation, these surplus segments may be read from disk many times causing performance degradation.
- Work Around: Increase allocated inodes in the problem fileset.
- Problem trigger: File systems created with large NumNodes and large NumInodesToPreallocate arguments. Then independent filesets are created with orders of magnitude less NumInodesToPreallocate.
- Symptom: Performance Impact/Degradation
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15991
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Fileset might get stuck and prevent filesystem quiesce when AFM DR filesets finds that inode did not have remote attributes and it tries to build the remote attributes using tsfindinode command after blocking the filesystem quiesce. Remote attributes are used to find the remote file using the file handle for the replication.
- Work Around: None
- Problem trigger: AFM DR with renames to the deleted directories
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments
- Functional Area affected: AFM DR
- Customer Impact: Critical IJ16024
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: FSErrInodeCorrupted FSSTRUCT error could be issued incorrectly during lookup when both directory and its parent directory are being deleted.
- Work Around: None
- Problem trigger: Perform lookup on '..' entry of a directory that is being deleted.
- Symptom: Error output/message
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: Suggested IJ15916
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: During file manager take over, the new manager will broadcast to all mount nodes to invalidate their cached low level file metadata. If at the same time, a low level file is being opened on the mount node, they have chance to race and causes logAssertFailed "ibdP->llfileP == this" or logAssertFailed "inode.indirectionLevel >= 1
- Work Around: One of our customers reports they hit this problem while they run mmdelsnapshot. For mmdelsnapshot scenario, deleting the oldest snapshot first will greatly reduce the risk.
- Problem trigger: The race existing between file manager take over and low level file opening (the latter one can happen for many reasons - including but not limited to mmdelsnapshot)
- Symptom: Abend/Crash
- Platforms affected: All
- Functional Area affected: All
- Customer Impact: High Importance IJ15961
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: GPFS admin commands may cause high CPU usage. This is due to remote GPFS command calls find command to cleanup tmp files on system with large number of subdirs and files under /var/mmfs/tmp.
- Work Around: Manually cleanup to reduce the number of subdirs and files under /var/mmfs/tmp. Kill running find processes that invoked from /usr/lpp/mmfs/mmremote processes.
- Problem trigger: Nodes with large number of subdirs and files under /var/mmfs/tmp are mostly likely affected.
- Symptom: Performance Impact/Degradation, hang
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: High Importance an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ15858
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not checking session info length when creating DMAPI session which is supposed to be less than or equal to 256 bytes. As per DMAPI standards it needs to return E2BIG errno. Instead GPFS is truncating the length to 256 bytes and proceeding with the session creation.
- Work Around: None
- Problem trigger: Creating DMAPI session with very long session info string
- Symptom: None
- Platforms affected: All
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ16117
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The arping command is used by the NFS failover mechanism, but was not found on the system. It was installed, but the log files show a No such file or directory message, which indicates that the arping command was not found in the expected path.
- Work Around: Probably it would help to set a symbolic link from the arping command to "/usr/bin/arping", which is the default if the distro could not be properly detected. Basically using links is not advised, since they could be a security issue.
- Problem trigger: The circumstances which lead to the issue is not fully understood. Most likely the OS detection using the /etc/redhat-release file detection did not work, so that the wrong distro was assumed, which lead to a wrong expected path name for the arping command location. So finally it was not found then. This older CentOS version does not yet have the /etc/os-release file provided by newer distros, which we use meanwhile, too.
- Symptom: Error output/message
- Platforms affected: All CentOS environments (CES nodes)
- Functional Area affected: CES
- Customer Impact: has little or no impact on customer operation IJ15998
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Deadlock during the AFM fileset recovery due to lock ordering issue when rename operations are being executed
- Work Around: None
- Problem trigger: AFM fileset recovery with renames to newly created directories.
- Symptom: Long Waiters/Deadlock
- Platforms affected: All Linux OS
- Functional Area affected: AFM and AFM DR
- Customer Impact: Critical IJ15963
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: gpfs systemd service (gpfs.service) may report fail after shutdown
- Work Around: The fail systemd status is not an error condition of GPFS shutdown. The systemd fail status can be ignored.
- Problem trigger: When shutting down GPFS, if the main systemd process (runmmfs) does not exit quickly, a kill signal is sent to the main process either by the shutdown subroutine or by systemd manager itself.
- Symptom: Error output/message Unexpected Results/Behavior
- Platforms affected: ALL Linux OS environments with systemd version >= 219
- Functional Area affected: Admin Commands/systemd
- Customer Impact: has little or no impact on customer operation IJ15962
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Using mmchcluster command to enable CCR may fail. While the mmchcluster command is working to enable ccr, any other mm cmd can remove authorized_ccr_keys file which is needed for in the final step of CCR enable. This problem occurs more often when the first quorum node in the list is on a GPFS supported systemd system. If the mmchcluster command is running on a quorum node, the mmchcluster command considers that node is the first quorum node in the list.
- Work Around: Run mmchcluster on a quorum node that does not support GPFS systemd. Or temporarily disable system health chmod 000 /usr/lpp/mmfs/bin/mmsysmon*
- Problem trigger: While the mmchcluster command is working to enable ccr, any other mm cmd can remove authorized_ccr_keys file which is needed for in the final step of CCR enable.
- Symptom: Error output/message
- Platforms affected: ALL Linux OS environments with systemd version >= 219
- Functional Area affected: CCR Admin Commands
- Customer Impact: High Importance to customers that want to enable CCR. IJ15915
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Some GPFS commands don't work correctly if the cluster name contains special characters.
- Work Around: Change the name of the cluster so that it does not contain any special characters.
- Problem trigger: Cluster name with special character like the ampersand "&" causes command like mmauth show . to fail
- Symptom: GPFS admin commands error. Error output/message Unexpected Results/Behavior
- Platforms affected: all
- Functional Area affected: admin commands
- Customer Impact: Low Importance IJ15908
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: AFM does not keep directory mtime in sync while reading the directory contents from the home. This may be a problem for some users during the migration
- Work Around: None
- Problem trigger: AFM migration/prefetch or cache readdir/lookup
- Symptom: Unexpected results
- Platforms affected: All Linux OS
- Functional Area affected: AFM
- Customer Impact: Critical IJ15990
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: The NFS/Ganesha service did not process I/O, and the systemhealth monitor showed that the NFS NULL checks for protocol versions 3 and 4 failed. The Ganesha process was shown in the process list, and also logging and replies to requests via Dbus worked. There was no failover.
- Work Around: Manual restart of NFS/Ganesha mmces service stop nfs ( or kill the gpfs.ganesha process) then mmces service start nfs
- Problem trigger: The reason why the NFS/Ganesha hung was not evaluated. The main issue was that the Ganesha process was not entirely "dead" since the process was running, and it replied to remote requests via Dbus and also wrote log entries. It was "dead" regarding I/O handling, but the systemhealth monitor did not notice this properly.
- Symptom: Performance Impact/Degradation
- Platforms affected: ALL Linux OS environments (CES Nodes running NFS)
- Functional Area affected: CES
- Customer Impact: High Importance IJ16036
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Assert exp(totalLen <= extensionLen) in line 16424 of file /project/sprelttn423/build/rttn423s008a/src/avs/fs/mmfs/ts/nsd/nsdServer.C
- Work Around: None
- Problem trigger: This issue affects customers running IBM Spectrum Scale 4.2.3 and later if the following conditions are true 1) mixed-endianess cluster, or mixed-endianess remote clusters. 2) RDMA enabled (and NSD client may send NSD requests to a NSD server which has a different endianess) 3) NSD client or NSD Server is IBM Spectrum Scale 4.2.3 It's a rare case assert which may happen when the client sends the first NSD request to a NSD server which has different endianess.
- Symptom: Abend/Crash
- Platforms affected: ALL Linux OS environments
- Functional Area affected: RDMA
- Customer Impact: Suggested IJ16020
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: Not failing with err 22 when dm_getall_disp dmapi is called with bad sessionId
- Work Around: None
- Problem trigger: When dm_getall_disp is called bad sessionId
- Symptom: Error output/message
- Platforms affected: ALL
- Functional Area affected: DMAPI
- Customer Impact: Suggested IJ16064
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: mmfsck man page provides instruction to clear the
fsstruct error from the mmhealth command. "mmsysmonc event filesystem
fsstruct_fixed
" But this is not correct. As a result, documented command will fail with syntax error. - Work Around: None
- Problem trigger: Executing command as instructed in man page
- Symptom: Error output/message due to documentation problem
- Platforms affected: ALL Operating System environments
- Functional Area affected: System Health
- Customer Impact: High Importance IJ16329
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- Problem description: A recent performance change in GPFS 5.0.3 that GPFS commands more sensitive to network congestion. This causes command like mmgetstate -a to report unknown status or other GPFS commands to report nodes unreached.
- Work Around: Command like mmgetstate -a can be issued again to get the status.
- Problem trigger: This affects only on node running GPFS 5.0.3. It affects all GPFS admin commands that need to execute command remotely.
- Symptom: Error message like the below: "The following nodes could not be reached:" mmgetstate -N or -a reports "unknown" state.
- Platforms affected: All
- Functional Area affected: Admin Commands
- Customer Impact: High Importance: an issue which will cause a degradation of the system in some manner, or loss of a less central capability IJ16395
- IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
- This update addresses the following APARs: IJ15858 IJ15859 IJ15892 IJ15908 IJ15909 IJ15910 IJ15912 IJ15914 IJ15915 IJ15916 IJ15961 IJ15962 IJ15963 IJ15990 IJ15991 IJ15993 IJ15996 IJ15998 IJ16020 IJ16024 IJ16036 IJ16064 IJ16079 IJ16084 IJ16085 IJ16091 IJ16093 IJ16094 IJ16095 IJ16110 IJ16112 IJ16113 IJ16114 IJ16116 IJ16117 IJ16329 IJ16395.
Problems fixed in Spectrum Scale 5.0.3.3 for Protocols include the following:
- smb: Add missing newline in ctdb_states_proxy error message
- smb: Add additional owner entry when mapping to NFS4 ACL with IDMAP_TYPE_BOTH and implement special case for denying owner access to ACL
- smb: Add gpfs.smb 4.9.8_gpfs_21-4
Problems fixed in Spectrum Scale 5.0.3.2 for Protocols include the following:
- smb: Return share name in correct case from net rpc conf showshare
- smb: Add gpfs.smb 4.9.8_gpfs_21-1
Problems fixed in Spectrum Scale 5.0.3.1 for Protocols include the following:
- gui: AD names should allow dots
- gui: Better handling on warning message for remote mounted file systems
- gui: Filesets - The "Type" and "AFM Role" displayed in the export correction
- gui: Updates required to accurately show GNR User Condition definitions
- gui: The CAPACITY_LICENSE task fails when there are no NSDs
- gui: Edit quota dialog not displayed for user,group,fileset quotas
- gui: Do no longer display a warning or error icon on SSD endurance percentage
- gui: Hourly call to mmaudit list should not occur
- toolkit: Fixed WCE parsing for some SAS cards
- smb: Version 4.9.7_gpfs_20-1
- smb: Change the memory check to cover the total of main memory and swap space
- smb: Stabilize gencache after gencache flush
- smb: Fill gencache with domain info returned from domain controller
- smb: Enable logging for early startup failures
- smb: Properly track the size of talloc objects
- smb: Remove implementations of SaveKey/RestoreKey
- smb: Pass back what we have in _wbint_Sids2UnixIDs().
- callhome: Updated to 5.0.3-1 nomenclature
- kafka: Updated to 5.0.3-1 nomenclature
Problems fixed in Spectrum Scale Protocols Packages 5.0.3-0 [Apr 19, 2019]
- Please see the "What's New" page in the IBM Knowledge Center