case_run: ERROR: ice: Input nprocs not same as system request

ycliu · Oct 14, 2020

Dear Helper, when i run a case f.e20.F2000climo.f09_f09_mg17.test in CESM2.2.0, everything is fine in ./case.setup and ./case.build. but there was an error in ./case.submit. The system shows:

run command is mpiexec -n 48 -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe >> cesm.log.$LID 2>&1
Exception from case_run: ERROR: RUN FAIL: Command 'mpiexec -n 48 -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run/cesm.log.201015-094559
Submit job case.st_archive
Starting job script case.st_archive
st_archive starting
moving /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run/cesm.log.201015-094559 to /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/archive/logs/cesm.log.201015-094559
Cannot find a f.e20.F2000climo.f09_f09_mg17.test.cpl*.r.*.nc file in directory /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run
Archiving history files for cam (atm)
Archiving history files for clm (lnd)
Archiving history files for cice (ice)
Archiving history files for docn (ocn)
Archiving history files for mosart (rof)
Archiving history files for cism (glc)
Archiving history files for drv (cpl)
Archiving history files for dart (esp)
st_archive completed
Submitted job case.run with id None
Submitted job case.st_archive with id None

then i cat the cesm.log. i think the hint: "ice: Input nprocs not same as system request" maybe the root of the problem. Can anyone tell me what happened and how to fix it? thanks
My cpu is Intel(R) Xeon(R) Platinum 8164 CPU @ 2.00GHz, have 104 threads. compiler is intel(ifort, icc). Here are some logs during runing: env_mach_pes.xml. cesm.log.201015-095129. and cpl.log.201015-095129

nusbaume · Oct 15, 2020

Hello,

I think the issue you are running into is that in env_mach_pes.xml, you requested 32 processors, but case.submit is asking for 48 (hence the -n 48 argument). What is the result if you go into your case and do ./preview_run? If you get -n 48 again, can you run ./case.setup --reset and then run ./preview_run again to see if the value changes to 32?

If it does, then it likely means that env_mach_pes.xml was modified after case.setup, which means the scripts and model build were out-of-sync in terms of processor number. However, once the setup has been "reset" then you should be good to go (although it probably wouldn't hurt to do a re-build, just in case).

Good luck, and have a great day!

Jesse

ycliu · Oct 15, 2020

nusbaume said:
Hello,

I think the issue you are running into is that in env_mach_pes.xml, you requested 32 processors, but case.submit is asking for 48 (hence the -n 48 argument). What is the result if you go into your case and do ./preview_run? If you get -n 48 again, can you run ./case.setup --reset and then run ./preview_run again to see if the value changes to 32?

If it does, then it likely means that env_mach_pes.xml was modified after case.setup, which means the scripts and model build were out-of-sync in terms of processor number. However, once the setup has been "reset" then you should be good to go (although it probably wouldn't hurt to do a re-build, just in case).

Good luck, and have a great day!

Jesse

Thank you for your replying. I take your advice that i mainly changed -n 48 to -n 32 in config_machines.xml(this value is setted manually). and create the same new case again, run ./case.setup , then run ./preview_run, it shows :

CASE INFO:
  nodes: 4
  total tasks: 32
  tasks per node: 8
  thread count: 1

BATCH INFO:
  FOR JOB: case.run
    ENV:
Setting Environment OMP_STACKSIZE=256M
      Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      None

    MPIRUN (job=case.run):
      mpiexec  -n 32   -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe   >> cesm.log.$LID 2>&1

  FOR JOB: case.st_archive
    ENV:
Setting Environment OMP_STACKSIZE=256M
      Setting Environment OMP_STACKSIZE=256M
Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel
      Setting Environment OMP_NUM_THREADS=1

    SUBMIT CMD:
      None

then run the ./case.build, ./case.submit. But it shows the same error again.. Always ice component.. Heres modified run's Log: config_machines.xml, env_mach_pes.xml, cesm.log.201016-102346, cpl.log.201016-102346. By the way, our linux server doesnt have job managerment software, does it have effect to this error? Anyway thanks to your advice

nusbaume · Oct 16, 2020

Hello,

To start with, I don't believe you should be specifying the exact number of processors in config_machines.xml. Instead, replace the number with the line:

{{ total_tasks }}

in the config file. That way CIME can determine the appropriate number of tasks using the information you provide in env_mach_pes.xml, which will help avoid conflicts in the future.

Also, after digging around a little, it looks like the error you are getting is being caused by the fact that you are missing the -DCESMCOUPLED CPP flag for the compiler. This should be there by default, so did you do anything to modify the CPP flags you are sending the compiler, and if so, can you make sure -DCESMCOUPLED is present, at least when building the CICE model?

If you aren't sure what I am talking about, or need help modifying the compiler flags, please let me know.

Thanks, and have a good weekend!

Jesse

ycliu · Oct 17, 2020

nusbaume said:
Hello,

To start with, I don't believe you should be specifying the exact number of processors in config_machines.xml. Instead, replace the number with the line:

{{ total_tasks }}

in the config file. That way CIME can determine the appropriate number of tasks using the information you provide in env_mach_pes.xml, which will help avoid conflicts in the future.

Also, after digging around a little, it looks like the error you are getting is being caused by the fact that you are missing the -DCESMCOUPLED CPP flag for the compiler. This should be there by default, so did you do anything to modify the CPP flags you are sending the compiler, and if so, can you make sure -DCESMCOUPLED is present, at least when building the CICE model?

If you aren't sure what I am talking about, or need help modifying the compiler flags, please let me know.

Thanks, and have a good weekend!

Jesse

Thank you for practical advice. I will replace the number with {{ total_tasks }} in the later experiments.

On the other hand, i didnt understand the meaning of '-DCESMCOUPLED CPP flag ' you said. I cant find any information In the Google search, too. what i know is that i add the netcdf library path into the config_compilers.xml, and nothing else has changed, as is show in the attachment. Could you please tell me how to check/modify the compiler flags or where to find the guide?

Have a good weekend, too!

nusbaume · Oct 17, 2020

Hello,

It looks like your config_compilers.xml file is missing these lines:

XML:

<!-- Define default values that can be overridden by specific
     compilers -->
<compiler>
  <CPPDEFS>
    <!-- This should be removed AFTER MOM6 cap is fully unified -->
    <append> -DCESMCOUPLED </append>
    <append MODEL="pop"> -D_USE_FLOW_CONTROL </append>
    <append MODEL="ufsatm"> -DSPMD </append>
  </CPPDEFS>

  <INCLDIR>
        <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append>
  </INCLDIR>
  <FFLAGS>
    <append MODEL="ufsatm"> $(FC_AUTO_R8) </append>
    <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append>
  </FFLAGS>
  <SUPPORTS_CXX>FALSE</SUPPORTS_CXX>
</compiler>

Try adding those lines to your config_compilers file and then re-building your case.

Also, another possible option is to add the -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse

ycliu · Oct 18, 2020

nusbaume said:
Hello,

It looks like your config_compilers.xml[/CODE] file is missing these lines: [CODE=xml] <compiler> <CPPDEFS>  <append> -DCESMCOUPLED </append> <append MODEL="pop"> -D_USE_FLOW_CONTROL </append> <append MODEL="ufsatm"> -DSPMD </append> </CPPDEFS> <INCLDIR> <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append> </INCLDIR> <FFLAGS> <append MODEL="ufsatm"> $(FC_AUTO_R8) </append> <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append> </FFLAGS> <SUPPORTS_CXX>FALSE</SUPPORTS_CXX> </compiler> [/CODE] Try adding those lines to your config_compilers file and then re-building your case. Also, another possible option is to add the [ICODE] -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse

Amazing! This kind error didnt show again after taking your advice. And it seems the initialization of all parts of model is normal. Thank you very much!

But a new error appear during the running(after

./case.submit[/CODE]:
[CODE=bash]
SHR_REPROSUM_CALC: Input contains  0.92160E+04 NaNs and  0.00000E+00 INFs on process      31
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image              PC                Routine            Line        Source          
cesm.exe           0000000002C1709A  Unknown               Unknown  Unknown
cesm.exe           000000000274285E  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           000000000287199D  shr_reprosum_mod_         480  shr_reprosum_mod.F90
cesm.exe           000000000064A2CD  par_xsum_                  72  par_xsum.F90
cesm.exe           0000000000F57A3F  te_map_mod_mp_te_         463  te_map.F90
cesm.exe           0000000000599CC5  dyn_comp_mp_dyn_r        2643  dyn_comp.F90
cesm.exe           0000000000F13714  stepon_mp_stepon_         315  stepon.F90
cesm.exe           0000000000501F0B  cam_comp_mp_cam_r         244  cam_comp.F90
cesm.exe           00000000004F3EB7  atm_comp_mct_mp_a         521  atm_comp_mct.F90
cesm.exe           0000000000435CFE  component_mod_mp_         737  component_mod.F90
cesm.exe           000000000041913C  cime_comp_mod_mp_        2823  cime_comp_mod.F90
cesm.exe           0000000000435987  MAIN__                    133  cime_driver.F90
cesm.exe           0000000000416B42  Unknown               Unknown  Unknown
libc-2.27.so       0000146FAE068B97  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000416A2A  Unknown               Unknown  Unknown
[cli_31]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 31
mapz error - k0found i j k (kk,pe1,pe2) =           -1           1         181
           1           1   5.41842637234723        5.41842637234723  
           2                     NaN   6.22092646824357                3
                     NaN   6.92342658297715                4
                     NaN   7.52592656868989                5
                     NaN   7.99748407922068                6
                     NaN   8.27570747777444                7
                     NaN   8.45775097219363                8
                     NaN   8.63480730648853                9
                     NaN   8.80688071369288               10
                     NaN   8.99592633509962               11
                     NaN   9.15842627823409               12
                     NaN   9.32092611044732               13
                     NaN   9.48342596191556               14
                     NaN   9.64592598145827               15
                     NaN   9.80842578148580               16
                     NaN                     NaN          17
                     NaN                     NaN          18
                     NaN                     NaN          19
                     NaN                     NaN          20
                     NaN                     NaN          21
                     NaN                     NaN          22
                     NaN                     NaN          23
                     NaN                     NaN          24
                     NaN                     NaN          25
                     NaN                     NaN          26
                     NaN                     NaN          27
                     NaN                     NaN          28
                     NaN                     NaN          29
                     NaN                     NaN          30
                     NaN                     NaN          31
                     NaN                     NaN          32
                     NaN                     NaN          33
                     NaN                     NaN
ERROR: MAPZ_MODULE
[/CODE]
It seems like input data have some problems literally. But what puzzles me is that these input data is automatically downloaded by [ICODE]./check_input_data --download[/CODE]. On the other hand, i guess maybe my [ICODE]env_run.xml[/CODE] setting has some problems?(I totally use the default setting:
STOP_OPTION=ndays,STOP_N=5). I put run log files[ICODE]cesm.log.201018-145403 [/CODE],[ICODE]cpl.log.201018-145403 [/CODE],and my [ICODE]env_run.xml[/CODE] in the attachment.

Anyway, thanks~ Have a nice day!

ycliu · Oct 18, 2020

nusbaume said:
Hello,

It looks like your config_compilers.xml file is missing these lines:

XML:

 <compiler> <CPPDEFS>  <append> -DCESMCOUPLED </append> <append MODEL="pop"> -D_USE_FLOW_CONTROL </append> <append MODEL="ufsatm"> -DSPMD </append> </CPPDEFS> <INCLDIR> <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append> </INCLDIR> <FFLAGS> <append MODEL="ufsatm"> $(FC_AUTO_R8) </append> <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append> </FFLAGS> <SUPPORTS_CXX>FALSE</SUPPORTS_CXX> </compiler>

Try adding those lines to your config_compilers file and then re-building your case.

Also, another possible option is to add the -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse

Sorry, the first reply has format problem. only view this reply is ok.

Amazing! This kind error didnt show again after taking your advice. And it seems the initialization of all parts of model is normal. Thank you very much!

But a new error appear during the running(after ./case.submit):

Bash:

SHR_REPROSUM_CALC: Input contains  0.92160E+04 NaNs and  0.00000E+00 INFs on process      31
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image              PC                Routine            Line        Source      
cesm.exe           0000000002C1709A  Unknown               Unknown  Unknown
cesm.exe           000000000274285E  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           000000000287199D  shr_reprosum_mod_         480  shr_reprosum_mod.F90
cesm.exe           000000000064A2CD  par_xsum_                  72  par_xsum.F90
cesm.exe           0000000000F57A3F  te_map_mod_mp_te_         463  te_map.F90
cesm.exe           0000000000599CC5  dyn_comp_mp_dyn_r        2643  dyn_comp.F90
cesm.exe           0000000000F13714  stepon_mp_stepon_         315  stepon.F90
cesm.exe           0000000000501F0B  cam_comp_mp_cam_r         244  cam_comp.F90
cesm.exe           00000000004F3EB7  atm_comp_mct_mp_a         521  atm_comp_mct.F90
cesm.exe           0000000000435CFE  component_mod_mp_         737  component_mod.F90
cesm.exe           000000000041913C  cime_comp_mod_mp_        2823  cime_comp_mod.F90
cesm.exe           0000000000435987  MAIN__                    133  cime_driver.F90
cesm.exe           0000000000416B42  Unknown               Unknown  Unknown
libc-2.27.so       0000146FAE068B97  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000416A2A  Unknown               Unknown  Unknown
[cli_31]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 31
mapz error - k0found i j k (kk,pe1,pe2) =           -1           1         181
           1           1   5.41842637234723        5.41842637234723
           2                     NaN   6.22092646824357                3
                     NaN   6.92342658297715                4
                     NaN   7.52592656868989                5
                     NaN   7.99748407922068                6
                     NaN   8.27570747777444                7
                     NaN   8.45775097219363                8
                     NaN   8.63480730648853                9
                     NaN   8.80688071369288               10
                     NaN   8.99592633509962               11
                     NaN   9.15842627823409               12
                     NaN   9.32092611044732               13
                     NaN   9.48342596191556               14
                     NaN   9.64592598145827               15
                     NaN   9.80842578148580               16
                     NaN                     NaN          17
                     NaN                     NaN          18
                     NaN                     NaN          19
                     NaN                     NaN          20
                     NaN                     NaN          21
                     NaN                     NaN          22
                     NaN                     NaN          23
                     NaN                     NaN          24
                     NaN                     NaN          25
                     NaN                     NaN          26
                     NaN                     NaN          27
                     NaN                     NaN          28
                     NaN                     NaN          29
                     NaN                     NaN          30
                     NaN                     NaN          31
                     NaN                     NaN          32
                     NaN                     NaN          33
                     NaN                     NaN
ERROR: MAPZ_MODULE

It seems like input data have some problems literally. But what puzzles me is that these input data is automatically downloaded by ./check_input_data --download. On the other hand, i guess maybe my env_run.xml setting has some problems?(I totally use the default setting:
STOP_OPTION=ndays,STOP_N=5). I put run log filescesm.log.201018-145403 ,cpl.log.201018-145403 ,and my env_run.xml in the attachment.

Anyway, thanks~ Have a nice day!

nusbaume · Oct 19, 2020

Hello,

It looks like this new error is because somewhere in your horizontal wind field (u and v), temperature field, and/or pressure field there are NaNs. Does this occur in the first time step (e.g., what's the largest value of nstep in atm.log.XXX)? If it is occurring on the first or second time step then it is likely an initialization issue.

For downloading input data, try running check_input_data again, except this time also include the --chksum flag, which will make the script check if something is different between your downloaded copies of the input data files and the versions on the NCAR server.

If the checksum passes, then it could indicate a problem with the model code or runtime environment. To test this I would try running an "X" case, as it is the simplest case CESM has, and then if that runs successfully trying an "A" case, and then if that runs ok trying a lower-resolution F2000 case (say with an f45_f45_mg37 grid). If all of those work then it would help narrow down what the real problem is.

Finally, this error does not appear to be caused by your env_run.xml modifications, at least as far as I can tell.

Anyways, good luck with the tests, and have a great day!

Jesse

ycliu · Oct 20, 2020

nusbaume said:
Hello,

It looks like this new error is because somewhere in your horizontal wind field (u and v), temperature field, and/or pressure field there are NaNs. Does this occur in the first time step (e.g., what's the largest value of nstep in atm.log.XXX)? If it is occurring on the first or second time step then it is likely an initialization issue.

For downloading input data, try running check_input_data again, except this time also include the --chksum flag, which will make the script check if something is different between your downloaded copies of the input data files and the versions on the NCAR server.

If the checksum passes, then it could indicate a problem with the model code or runtime environment. To test this I would try running an "X" case, as it is the simplest case CESM has, and then if that runs successfully trying an "A" case, and then if that runs ok trying a lower-resolution F2000 case (say with an f45_f45_mg37 grid). If all of those work then it would help narrow down what the real problem is.

Finally, this error does not appear to be caused by your env_run.xml modifications, at least as far as I can tell.

Anyways, good luck with the tests, and have a great day!

Jesse

Hello,

Thank you for replying again. I check the atm.log.XXX and find the largest value of nstep indeed equal to 1, seems it is likely an initialization issue as you say.

Then I use check_input_data --chksum as your order. All inputdata passed the check.

At the same time, I run "X" case, f19_g17. ./case.build is fine. But break again when ./case.submit. The cesm.log is:

Code:

....
(seq_comm_printcomms)     1     0     8     1  GLOBAL:
(seq_comm_printcomms)     2     0     8     1  CPL:
(seq_comm_printcomms)     3     0     8     1  ALLATMID:
(seq_comm_printcomms)     4     0     8     1  CPLALLATMID:
(seq_comm_printcomms)     5     0     8     1  ATM:
(seq_comm_printcomms)     6     0     8     1  CPLATM:
(seq_comm_printcomms)     7     0     8     1  ALLLNDID:
(seq_comm_printcomms)     8     0     8     1  CPLALLLNDID:
(seq_comm_printcomms)     9     0     8     1  LND:
(seq_comm_printcomms)    10     0     8     1  CPLLND:
(seq_comm_printcomms)    11     0     8     1  ALLICEID:
(seq_comm_printcomms)    12     0     8     1  CPLALLICEID:
(seq_comm_printcomms)    13     0     8     1  ICE:
(seq_comm_printcomms)    14     0     8     1  CPLICE:
(seq_comm_printcomms)    15     0     8     1  ALLOCNID:
(seq_comm_printcomms)    16     0     8     1  CPLALLOCNID:
(seq_comm_printcomms)    17     0     8     1  OCN:
(seq_comm_printcomms)    18     0     8     1  CPLOCN:
(seq_comm_printcomms)    19     0     8     1  ALLROFID:
(seq_comm_printcomms)    20     0     8     1  CPLALLROFID:
(seq_comm_printcomms)    21     0     8     1  ROF:
(seq_comm_printcomms)    22     0     8     1  CPLROF:
(seq_comm_printcomms)    23     0     8     1  ALLGLCID:
(seq_comm_printcomms)    24     0     8     1  CPLALLGLCID:
(seq_comm_printcomms)    25     0     8     1  GLC:
(seq_comm_printcomms)    26     0     8     1  CPLGLC:
(seq_comm_printcomms)    27     0     8     1  ALLWAVID:
(seq_comm_printcomms)    28     0     8     1  CPLALLWAVID:
(seq_comm_printcomms)    29     0     8     1  WAV:
(seq_comm_printcomms)    30     0     8     1  CPLWAV:
(seq_comm_printcomms)    31     0     8     1  ALLESPID:
(seq_comm_printcomms)    32     0     8     1  CPLALLESPID:
(seq_comm_printcomms)    33     0     8     1  ESP:
(seq_comm_printcomms)    34     0     8     1  CPLESP:
(seq_comm_printcomms)    35     0     8     1  ALLIACID:
(seq_comm_printcomms)    36     0     8     1  CPLALLIACID:
(seq_comm_printcomms)    37     0     8     1  IAC:
(seq_comm_printcomms)    38     0     8     1  CPLIAC:
 (t_initf) Read in prof_inparm namelist from: drv_in
 (t_initf) Using profile_disable=          F
 (t_initf)       profile_timer=                      4
 (t_initf)       profile_depth_limit=                4
 (t_initf)       profile_detail_limit=               2
 (t_initf)       profile_barrier=          F
 (t_initf)       profile_outpe_num=                  1
 (t_initf)       profile_outpe_stride=               0
 (t_initf)       profile_single_file=      F
 (t_initf)       profile_global_stats=     T
 (t_initf)       profile_ovhd_measurement= F
 (t_initf)       profile_add_detail=       F
 (t_initf)       profile_papi_enable=      F
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
000.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 0
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
001.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 1
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
002.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_2]: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
004.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
005.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
006.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_6]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 6
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
007.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_7]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 7
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 2
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
003.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 3
[cli_4]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 4

Are the two errors(X and F2000 case) caused by the same thing? it seems two case all generate Nan or incorrect values during calculations.

To many thanks, have a nice day!

nusbaume · Oct 20, 2020

Hello,

I can't say for sure if the two errors are related, but in general if an "X" case doesn't run, then neither will any sort of "F" case. Sadly if an X case fails then it likely means that either the the model was ported incorrectly, or that one of the required libraries on your machine (e.g. MPI) is not installed properly. Given that it looks like someone stripped out a lot of the XML file information in CIME's config files, I would recommend re-downloading CESM2.2 and then re-trying the port, but this time just appending your machine onto the list without removing everything else, or try using the config_machines_template.xml file instead. I might also recommend following the porting instructions here:

6. Porting and validating CIME on a new platform — CIME master documentation

In particular, I would make sure to try out the MPI example test to make sure your version of MPI is working properly.

Finally, I am not really an expert when it comes to porting CESM to new machines. Thus I have moved this thread to the infrastructure forum, where people who are more knowledgeable then I am might be able to provide more useful answers.

Good luck, and have a great day!

Jesse

ycliu · Oct 22, 2020

nusbaume said:
Hello,

I can't say for sure if the two errors are related, but in general if an "X" case doesn't run, then neither will any sort of "F" case. Sadly if an X case fails then it likely means that either the the model was ported incorrectly, or that one of the required libraries on your machine (e.g. MPI) is not installed properly. Given that it looks like someone stripped out a lot of the XML file information in CIME's config files, I would recommend re-downloading CESM2.2 and then re-trying the port, but this time just appending your machine onto the list without removing everything else, or try using the config_machines_template.xml file instead. I might also recommend following the porting instructions here:

6. Porting and validating CIME on a new platform — CIME master documentation

In particular, I would make sure to try out the MPI example test to make sure your version of MPI is working properly.

Finally, I am not really an expert when it comes to porting CESM to new machines. Thus I have moved this thread to the infrastructure forum, where people who are more knowledgeable then I am might be able to provide more useful answers.

Good luck, and have a great day!

Jesse

Hello,

Thank you for offering so much information. Next I'm going to try reporting or switching to a different compiler and hopefully it will work.

Have a good day!

case_run: ERROR: ice: Input nprocs not same as system request

ycliu

New Member

Attachments

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

ycliu

New Member

Attachments

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

ycliu

New Member

Attachments

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

ycliu

New Member

Attachments

ycliu

New Member

Attachments

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

ycliu

New Member

Attachments

nusbaume

Jesse Nusbaumer

CSEG and Liaisons

ycliu

New Member