Dear all:
After I submit my <case>.run script for about less than two minutes, it would be killed. I tried to run an oxygen isotopes enabled startup PI case in our server based on the instructions of iCESM1.2 in github. And the compset is 'B1850C5', resolution is T31_g37. It's confusing that it can run successfully in another server (I tried before), but it failed in ours. I don't think the question occured in CAM POP or some other modules for no errors in these *.log.* files. In cesm.log, after the lines like 'calcsize j,iq,jac, lsfrm,lstoo ............', many 'QNEG3 from ...... mixing ratio violated at ...... points' followed, and then occured some 'BalanceCheck: soil balance error' and 'ERROR: Isotopic deep-conv precip error'. In the end, it says 'BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES' and 'slurmstepd: error: Detected 1 oom_kill event in StepId = ......' and 'srun: error: l06c41n2: task 0: Out Of Memory'.
Later I changed 'Debug' in env_build.xml to 'True', the same location in cesm.log after 'calcsize .......', the node I used says 'Caught signal 8 (Floating point exception: floating-point invalid operation) and then some information about backtrace and 'forrtl: error: floating point exception'. By the way, there are many 'NetCDF: Invalid dimension ID or name' and 'NetCDF: Variable or Attribute not found' before all of these, will it be something wrong in the netcdf module? But it worked well in running normal CESM1.2.
Following are some of the screenshots in the debugged cesm.log. And I attached the debugged cesm.log and normal cesm.log files in attachment.
I'm grateful for any relevant suggestions or solutions in solving my problem.
After I submit my <case>.run script for about less than two minutes, it would be killed. I tried to run an oxygen isotopes enabled startup PI case in our server based on the instructions of iCESM1.2 in github. And the compset is 'B1850C5', resolution is T31_g37. It's confusing that it can run successfully in another server (I tried before), but it failed in ours. I don't think the question occured in CAM POP or some other modules for no errors in these *.log.* files. In cesm.log, after the lines like 'calcsize j,iq,jac, lsfrm,lstoo ............', many 'QNEG3 from ...... mixing ratio violated at ...... points' followed, and then occured some 'BalanceCheck: soil balance error' and 'ERROR: Isotopic deep-conv precip error'. In the end, it says 'BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES' and 'slurmstepd: error: Detected 1 oom_kill event in StepId = ......' and 'srun: error: l06c41n2: task 0: Out Of Memory'.
Later I changed 'Debug' in env_build.xml to 'True', the same location in cesm.log after 'calcsize .......', the node I used says 'Caught signal 8 (Floating point exception: floating-point invalid operation) and then some information about backtrace and 'forrtl: error: floating point exception'. By the way, there are many 'NetCDF: Invalid dimension ID or name' and 'NetCDF: Variable or Attribute not found' before all of these, will it be something wrong in the netcdf module? But it worked well in running normal CESM1.2.
Following are some of the screenshots in the debugged cesm.log. And I attached the debugged cesm.log and normal cesm.log files in attachment.
I'm grateful for any relevant suggestions or solutions in solving my problem.