Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Successfully build, failed to submit, empty cesm.log

Josefine

SHI Jiaqi
New Member
Hi,
I am trying to port cesm2.1.3-rc.01-0-g0596a97 to a new machine. Module is intel+mkl+impi/2018u4.
I changed config_compilers.xml, config_machines.xml, and config_batch.xml (attached below), it seems to be working and build cesm.exe successfully with B1850 and f09_g17.

I tried to submit in 2 ways (both failed)

(1) pbs files the administrator of our cluster provided(cesm.pbs)

Code:
#!/bin/bash -x
#PBS -l nodes=1:ppn=80
#PBS -j oe
#PBS -q test1
#PBS -N mytest

BIN_DIR=/data/home/yanghj/CESM2/scratch/f19_g17.B1850.ctrl/bld

# Setup the OpenMPI topology
n_proc=$(cat $PBS_NODEFILE | wc -l)
cd $PBS_O_WORKDIR

/usr/bin/awk '{a[$1]++}b[$1,$2]!=1{b[$1]++}{b[$1,$2]=1}END{for (i in a)print i":"a[i]}' $PBS_NODEFILE > mpd.hosts

ulimit -s unlimited

module load intel+mkl+impi/2018u4

qsub -q test2 -l walltime=3000:00:00 -A nhpcc -v ARGS_FOR_SCRIPT='--resubmit' .case.run
qsub -q test2 -l nodes=1:ppn=80 -A nhpcc -v ARGS_FOR_SCRIPT='--resubmit' .case.run

mpirun -np $n_proc $BIN_DIR/cesm.exe >> cesm.log 2>&1

exit 0
and with
Code:
qsub cesm.pbs -l nodes=10:ppn=80 -q test2 -N test
the jobs seems to be submitted successfully, when checked with bjob it shows status "R". It seems to be running well for hours when i check with bjobs. However, their is no archieve files. I canceled the job. And found the cesm.log is empty, and in test.log (complete file attached below) there is
+ qsub -q test2 -l walltime=3000:00:00 -A nhpcc -v ARGS_FOR_SCRIPT=--resubmit .case.run
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating yanghj/yanghj from c130)
+ qsub -q test2 -l nodes=1:ppn=80 -A nhpcc -v ARGS_FOR_SCRIPT=--resubmit .case.run
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating yanghj/yanghj from c130)

(2)I also tried ./case.submit directly
I met with problem
ERROR: Couldn't match jobid_pattern '^(\S+) within submit output:
I tried to change jobid_pattern to (\d+)
preview_run shows:
CASE INFO:
nodes: 9
total tasks: 360
tasks per node: 40
thread count: 1

BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
qsub -q batch -l walltime=36:00:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run

MPIRUN (job=case.run):
mpirun -n 360 /data/home/yanghj/CESM2/scratch/f09_g17.B1850.ctrl/bld/cesm.exe >> cesm.log.$LID 2>&1

FOR JOB: case.st_archive
ENV:
Setting Environment OMP_NUM_THREADS=1

SUBMIT CMD:
qsub -q batch -l walltime=0:20:00 -W depend=afterok:0 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Then,
./case.submit (I noticed that I wasn't even asked to input password in this way, which is needed to submit jobs in our cluster)shows
submit_jobs case.run
Submit job case.run
Submitting job script qsub -q batch -l walltime=36:00:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
Submitted job id is 4
Submit job case.st_archive
Submitting job script qsub -q batch -l walltime=0:20:00 -W depend=afterok:4 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Submitted job id is 4
Submitted job case.run with id 4
Submitted job case.st_archive with id 4
But when i check with bjobs there is nothing.
It seems that i have made things really bad.......
 

Attachments

  • files.zip
    17.9 KB · Views: 12

jedwards

CSEG and Liaisons
Staff member
You should use case.submit to submit jobs. I think that you are on the right track, the jobid pattern is not being parsed correctly.
In the latest case.submit example you show all of the jobs seem to have the same job id (4). What should the job id have been in this case?

Before you run a B1850 case you really should step through the entire porting process starting with scripts_regression_tests.py This is documented here. 6. Porting and validating CIME on a new platform — CIME cime5.6 documentation
 

Josefine

SHI Jiaqi
New Member
You should use case.submit to submit jobs. I think that you are on the right track, the jobid pattern is not being parsed correctly.
In the latest case.submit example you show all of the jobs seem to have the same job id (4). What should the job id have been in this case?

Before you run a B1850 case you really should step through the entire porting process starting with scripts_regression_tests.py This is documented here. 6. Porting and validating CIME on a new platform — CIME cime5.6 documentation

Thanks a lot for your advise! I am not sure where the job id (4) comes from...It seems a little strange...
I am trying to port from the beginning and check if i can fix it.
 

hyf412694462

HE Yanfeng
New Member
Hi,

Have you solved this problem? When I invoke case.submit, I also got the following error message:
ERROR: Couldn't match jobid_pattern '^(\S+) within submit output:

I want to know what does ^(\S+) and \d+ mean, and how to specify jobid_pattern in config_batch.xml?

I would like to appreciate any information you could provide.
 

jedwards

CSEG and Liaisons
Staff member
That is a regular expression statement meant to recover the jobid from the return value of qsub.
'^(\S+) means all contiguous not-white space characters from the beginning of the line.
\d+ means a contiguous set of digits.
 

hyf412694462

HE Yanfeng
New Member
Hello jedwards,

Thank your very much for your rapid reply, which is very helpful.
I checked the format of the jobid (request id) of my batch system, it looks like 1206262.scmngm.
So how should I specify jobid_pattern in config_batch.xml?
Should ^(\w+) works?
 

hyf412694462

HE Yanfeng
New Member
That is a regular expression statement meant to recover the jobid from the return value of qsub.
'^(\S+) means all contiguous not-white space characters from the beginning of the line.
\d+ means a contiguous set of digits.
Hello jedwards,

Thank your very much for your rapid reply, which is very helpful.
I checked the format of the jobid (==request id?) of my batch system, it looks like 1206262.scmngm.
Therefore, I think specifying .scmngm$ should be fine?
 
Hi all,
I'm having a similar error on Juno with LSF.
The jobid_pattern is defined as

<jobid_pattern>&lt;(\d+)&gt;</jobid_pattern>

But I get the error

ERROR: Couldn't match jobid_pattern '<(\d+)>' within submit output:
'684882'

I'm getting this issue after updating cime and ccs_config to version cime6.0.236_httpsbranch01 and ccs_config_cesm0.0.85, respectively.
Do these two versions have some incompatibility?
 

jedwards

CSEG and Liaisons
Staff member
It looks like maybe there was a change on your system where you went from a jobid like <684882> to just 684882. Try this jobid_pattern instead:
<jobid_pattern>(\d+)</jobid_pattern>
 
Top