Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

file system issue causing CESM to fail on Hopper using /scratch

aliceb

Member
Affected Releases: CESM1.0.z, CESM1.1.z, CESM1.2.z


NERSC Incident Report ID: INC0053630


A problem has been found for /scratch on Hopper, though the root cause has not been identified yet. If you change the following references from /scratch to /scratch2, the setup and build complete successfully.



Code:
cd $CASEROOT
./xml_change EXEROOT=/scratch2/scratchdirs/$CCSMUSER/$CASE/bld
./xml_change RUNDIR=/scratch2/scratchdirs/$CCSMUSER/$CASE/run
./xml_change DOUT_S_ROOT=/scratch2/scratchdirs/$CCSMUSER/archive/$CASE
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 
Top