Hi,
I am trying to port cesm2.1.3-rc.01-0-g0596a97 to a new machine. Module is intel+mkl+impi/2018u4.
I changed config_compilers.xml, config_machines.xml, and config_batch.xml (attached below), it seems to be working and build cesm.exe successfully with B1850 and f09_g17.
I tried to submit in 2 ways (both failed)
(1) pbs files the administrator of our cluster provided(cesm.pbs)
and with
the jobs seems to be submitted successfully, when checked with bjob it shows status "R". It seems to be running well for hours when i check with bjobs. However, their is no archieve files. I canceled the job. And found the cesm.log is empty, and in test.log (complete file attached below) there is
(2)I also tried ./case.submit directly
I met with problem
preview_run shows:
./case.submit (I noticed that I wasn't even asked to input password in this way, which is needed to submit jobs in our cluster)shows
It seems that i have made things really bad.......
I am trying to port cesm2.1.3-rc.01-0-g0596a97 to a new machine. Module is intel+mkl+impi/2018u4.
I changed config_compilers.xml, config_machines.xml, and config_batch.xml (attached below), it seems to be working and build cesm.exe successfully with B1850 and f09_g17.
I tried to submit in 2 ways (both failed)
(1) pbs files the administrator of our cluster provided(cesm.pbs)
Code:
#!/bin/bash -x
#PBS -l nodes=1:ppn=80
#PBS -j oe
#PBS -q test1
#PBS -N mytest
BIN_DIR=/data/home/yanghj/CESM2/scratch/f19_g17.B1850.ctrl/bld
# Setup the OpenMPI topology
n_proc=$(cat $PBS_NODEFILE | wc -l)
cd $PBS_O_WORKDIR
/usr/bin/awk '{a[$1]++}b[$1,$2]!=1{b[$1]++}{b[$1,$2]=1}END{for (i in a)print i":"a[i]}' $PBS_NODEFILE > mpd.hosts
ulimit -s unlimited
module load intel+mkl+impi/2018u4
qsub -q test2 -l walltime=3000:00:00 -A nhpcc -v ARGS_FOR_SCRIPT='--resubmit' .case.run
qsub -q test2 -l nodes=1:ppn=80 -A nhpcc -v ARGS_FOR_SCRIPT='--resubmit' .case.run
mpirun -np $n_proc $BIN_DIR/cesm.exe >> cesm.log 2>&1
exit 0
Code:
qsub cesm.pbs -l nodes=10:ppn=80 -q test2 -N test
+ qsub -q test2 -l walltime=3000:00:00 -A nhpcc -v ARGS_FOR_SCRIPT=--resubmit .case.run
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating yanghj/yanghj from c130)
+ qsub -q test2 -l nodes=1:ppn=80 -A nhpcc -v ARGS_FOR_SCRIPT=--resubmit .case.run
qsub: submit error (Bad UID for job execution MSG=ruserok failed validating yanghj/yanghj from c130)
(2)I also tried ./case.submit directly
I met with problem
I tried to change jobid_pattern to (\d+)ERROR: Couldn't match jobid_pattern '^(\S+) within submit output:
preview_run shows:
Then,CASE INFO:
nodes: 9
total tasks: 360
tasks per node: 40
thread count: 1
BATCH INFO:
FOR JOB: case.run
ENV:
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
qsub -q batch -l walltime=36:00:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
MPIRUN (job=case.run):
mpirun -n 360 /data/home/yanghj/CESM2/scratch/f09_g17.B1850.ctrl/bld/cesm.exe >> cesm.log.$LID 2>&1
FOR JOB: case.st_archive
ENV:
Setting Environment OMP_NUM_THREADS=1
SUBMIT CMD:
qsub -q batch -l walltime=0:20:00 -W depend=afterok:0 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
./case.submit (I noticed that I wasn't even asked to input password in this way, which is needed to submit jobs in our cluster)shows
But when i check with bjobs there is nothing.submit_jobs case.run
Submit job case.run
Submitting job script qsub -q batch -l walltime=36:00:00 -v ARGS_FOR_SCRIPT='--resubmit' .case.run
Submitted job id is 4
Submit job case.st_archive
Submitting job script qsub -q batch -l walltime=0:20:00 -W depend=afterok:4 -v ARGS_FOR_SCRIPT='--resubmit' case.st_archive
Submitted job id is 4
Submitted job case.run with id 4
Submitted job case.st_archive with id 4
It seems that i have made things really bad.......