Welcome to the new DiscussCESM forum!
We are still working on the website migration, so you may experience downtime during this process.

Existing users, please reset your password before logging in here: https://xenforo.cgd.ucar.edu/cesm/index.php?lost-password/

file system issue causing CESM to fail on Hopper using /scratch

aliceb

Administrator
Staff member
Affected Releases: CESM1.0.z, CESM1.1.z, CESM1.2.z


NERSC Incident Report ID: INC0053630


A problem has been found for /scratch on Hopper, though the root cause has not been identified yet. If you change the following references from /scratch to /scratch2, the setup and build complete successfully.



Code:
cd $CASEROOT
./xml_change EXEROOT=/scratch2/scratchdirs/$CCSMUSER/$CASE/bld
./xml_change RUNDIR=/scratch2/scratchdirs/$CCSMUSER/$CASE/run
./xml_change DOUT_S_ROOT=/scratch2/scratchdirs/$CCSMUSER/archive/$CASE
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 

jedwards

CSEG and Liaisons
Staff member
This problem is now resolved, here is the reply from NERSC:

 
The underline cause of this problem is now understood. An MSS node (IO server management node) failover happened on Hopper /scratch, and it did not preserve the custom default stripe count of 2, instead new files created have had the stripe count of 1. It has been corrected on Aug 5.
 
The default CESM setup script will use /scratch as defined in env_run.xml, env_build.xml files.
 
With stripe count of 1 for some files in the run directory of the CESM test cases, it triggers a problem of the "cp -p" failing with infinite loop of ioctl() system calls. This problem is resolved in the newer Lustre version, or a "cp" with a very old CLE version does not expose this problem. We will go with the upgrade Lustre route.
 
So if you create a new case now, the default stripe count of all new files will be 2, and the cesm_setup will complete smoothly.
 
For your old test cases created between 7/22 to 8/5 that had problems, you can copy the old directories to a new one, such as doing the following (I did the following in Pat's account):
 
The original directory has stripe count of 1:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
% cd $SCRATCH
% mv b1850c5_acme2_ne30g16_hopper3 b1850c5_acme2_ne30g16_hopper3.orig
% cp -r b1850c5_acme2_ne30g16_hopper3.orig b1850c5_acme2_ne30g16_hopper3
 
Now the new directory will have stripe count of 2:
% cd $SCRATCH/b1850c5_acme2_ne30g16_hopper3/run
% lfs getstripe *

 
And then can run "cesm_setup" from where the script is (it maybe in your $HOME, or $SCRATCH, or /project) successfully.
 
Top