jmnicklas@gmail_com
New Member
To those who may be able to help me,
I am currently trying to get CESM1.2.2 to run on Brown University's Oscar computer cluster. The system is SLURM-based, and currently I only have access to exploratory account which gives me access only 16 cpus for up to 24 hours for a single job. I have successfully compiled many cases, but somehow I have only succeeded at running the compset X testcase. All other attempts that made to run compset B in the past month have failed: either the f19g16 or f45g37 resolutions timeout, or the T31_gx3v7 resolution results in a cryptic MPI COMMUNICATOR 9 CREATE FROM 0 error. The former cases should not be timing out: I got them to run in a fraction of the 24 hours on my 4core, dual-processor laptop. I just want to get something, anything, to work at this point.
I have attached everything I think would be of relevance: configure files, and logs from one of the resolutions that timed out. The failed log from the compset B res T31_gx3v7 resolution is copied below. I am happy to search for whatever else you might need to solve these problems.
Again, any help would be greatly appreciated.
Sincerely,
John Nicklas ...In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 9)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 10)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 11)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 12)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 13)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 14)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 15)
96.0000000000000 and that obtained from surface dataset
100.000000000000 at g= 404
ENDRUN: called without a message string
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 3)
slurmstepd: *** STEP 14873889.0 ON node454 CANCELLED AT 2017-02-17T01:58:31 ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: node454: tasks 0-5,8-10,12-14: Killed
srun: error: node454: tasks 6-7,11,15: Exited with exit code 1
I am currently trying to get CESM1.2.2 to run on Brown University's Oscar computer cluster. The system is SLURM-based, and currently I only have access to exploratory account which gives me access only 16 cpus for up to 24 hours for a single job. I have successfully compiled many cases, but somehow I have only succeeded at running the compset X testcase. All other attempts that made to run compset B in the past month have failed: either the f19g16 or f45g37 resolutions timeout, or the T31_gx3v7 resolution results in a cryptic MPI COMMUNICATOR 9 CREATE FROM 0 error. The former cases should not be timing out: I got them to run in a fraction of the 24 hours on my 4core, dual-processor laptop. I just want to get something, anything, to work at this point.
I have attached everything I think would be of relevance: configure files, and logs from one of the resolutions that timed out. The failed log from the compset B res T31_gx3v7 resolution is copied below. I am happy to search for whatever else you might need to solve these problems.
Again, any help would be greatly appreciated.
Sincerely,
John Nicklas ...In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 9)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 10)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 11)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 12)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 13)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 14)
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 15)
96.0000000000000 and that obtained from surface dataset
100.000000000000 at g= 404
ENDRUN: called without a message string
In: PMI_Abort(1, application called MPI_Abort(comm=0xC4000008, 1) - process 3)
slurmstepd: *** STEP 14873889.0 ON node454 CANCELLED AT 2017-02-17T01:58:31 ***
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: node454: tasks 0-5,8-10,12-14: Killed
srun: error: node454: tasks 6-7,11,15: Exited with exit code 1