Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

case_run: ERROR: ice: Input nprocs not same as system request

ycliu

New Member
Dear Helper, when i run a case f.e20.F2000climo.f09_f09_mg17.test in CESM2.2.0, everything is fine in ./case.setup and ./case.build. but there was an error in ./case.submit. The system shows:

run command is mpiexec -n 48 -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe >> cesm.log.$LID 2>&1
Exception from case_run: ERROR: RUN FAIL: Command 'mpiexec -n 48 -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe >> cesm.log.$LID 2>&1 ' failed
See log file for details: /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run/cesm.log.201015-094559
Submit job case.st_archive
Starting job script case.st_archive
st_archive starting
moving /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run/cesm.log.201015-094559 to /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/archive/logs/cesm.log.201015-094559
Cannot find a f.e20.F2000climo.f09_f09_mg17.test.cpl*.r.*.nc file in directory /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/run
Archiving history files for cam (atm)
Archiving history files for clm (lnd)
Archiving history files for cice (ice)
Archiving history files for docn (ocn)
Archiving history files for mosart (rof)
Archiving history files for cism (glc)
Archiving history files for drv (cpl)
Archiving history files for dart (esp)
st_archive completed
Submitted job case.run with id None
Submitted job case.st_archive with id None

then i cat the cesm.log. i think the hint: "ice: Input nprocs not same as system request" maybe the root of the problem. Can anyone tell me what happened and how to fix it? thanks
My cpu is Intel(R) Xeon(R) Platinum 8164 CPU @ 2.00GHz, have 104 threads. compiler is intel(ifort, icc). Here are some logs during runing: env_mach_pes.xml. cesm.log.201015-095129. and cpl.log.201015-095129
 

Attachments

  • cesm.log.201015-095129.txt
    85.8 KB · Views: 11
  • cpl.log.201015-095129.txt
    45 KB · Views: 5
  • env_mach_pes.xml.txt
    7.2 KB · Views: 15

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hello,

I think the issue you are running into is that in env_mach_pes.xml, you requested 32 processors, but case.submit is asking for 48 (hence the -n 48 argument). What is the result if you go into your case and do ./preview_run? If you get -n 48 again, can you run ./case.setup --reset and then run ./preview_run again to see if the value changes to 32?

If it does, then it likely means that env_mach_pes.xml was modified after case.setup, which means the scripts and model build were out-of-sync in terms of processor number. However, once the setup has been "reset" then you should be good to go (although it probably wouldn't hurt to do a re-build, just in case).

Good luck, and have a great day!

Jesse
 

ycliu

New Member
Hello,

I think the issue you are running into is that in env_mach_pes.xml, you requested 32 processors, but case.submit is asking for 48 (hence the -n 48 argument). What is the result if you go into your case and do ./preview_run? If you get -n 48 again, can you run ./case.setup --reset and then run ./preview_run again to see if the value changes to 32?

If it does, then it likely means that env_mach_pes.xml was modified after case.setup, which means the scripts and model build were out-of-sync in terms of processor number. However, once the setup has been "reset" then you should be good to go (although it probably wouldn't hurt to do a re-build, just in case).

Good luck, and have a great day!

Jesse
Thank you for your replying. I take your advice that i mainly changed -n 48 to -n 32 in config_machines.xml(this value is setted manually). and create the same new case again, run ./case.setup , then run ./preview_run, it shows :
CASE INFO: nodes: 4 total tasks: 32 tasks per node: 8 thread count: 1 BATCH INFO: FOR JOB: case.run ENV: Setting Environment OMP_STACKSIZE=256M Setting Environment OMP_STACKSIZE=256M Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment OMP_NUM_THREADS=1 SUBMIT CMD: None MPIRUN (job=case.run): mpiexec -n 32 -genv LD_LIBRARY_PATH /home/liuyc/usr/local/netcdf4.7.4-Intel/lib:/home/liuyc/usr/local/netcdf4.7.4-Intel/lib/pkgconfig:$LD_LIBRARY_PATH /home/liuyc/usr/models/CESM2/2.2.0/my_cesm_sandbox/case/f.e20.F2000climo.f09_f09_mg17.test/bld/cesm.exe >> cesm.log.$LID 2>&1 FOR JOB: case.st_archive ENV: Setting Environment OMP_STACKSIZE=256M Setting Environment OMP_STACKSIZE=256M Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment NETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment PNETCDF_PATH=/home/liuyc/usr/local/netcdf4.7.4-Intel Setting Environment OMP_NUM_THREADS=1 SUBMIT CMD: None
then run the ./case.build, ./case.submit. But it shows the same error again.. Always ice component.. Heres modified run's Log: config_machines.xml, env_mach_pes.xml, cesm.log.201016-102346, cpl.log.201016-102346. By the way, our linux server doesnt have job managerment software, does it have effect to this error? Anyway thanks to your advice
 

Attachments

  • cesm.log.201016-102346.txt
    85.8 KB · Views: 1
  • config_machines.xml.txt
    4.1 KB · Views: 4
  • cpl.log.201016-102346.txt
    45 KB · Views: 1
  • env_mach_pes.xml.txt
    7.2 KB · Views: 1

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hello,

To start with, I don't believe you should be specifying the exact number of processors in config_machines.xml. Instead, replace the number with the line:

{{ total_tasks }}

in the config file. That way CIME can determine the appropriate number of tasks using the information you provide in env_mach_pes.xml, which will help avoid conflicts in the future.

Also, after digging around a little, it looks like the error you are getting is being caused by the fact that you are missing the -DCESMCOUPLED CPP flag for the compiler. This should be there by default, so did you do anything to modify the CPP flags you are sending the compiler, and if so, can you make sure -DCESMCOUPLED is present, at least when building the CICE model?

If you aren't sure what I am talking about, or need help modifying the compiler flags, please let me know.

Thanks, and have a good weekend!

Jesse
 

ycliu

New Member
Hello,

To start with, I don't believe you should be specifying the exact number of processors in config_machines.xml. Instead, replace the number with the line:

{{ total_tasks }}

in the config file. That way CIME can determine the appropriate number of tasks using the information you provide in env_mach_pes.xml, which will help avoid conflicts in the future.

Also, after digging around a little, it looks like the error you are getting is being caused by the fact that you are missing the -DCESMCOUPLED CPP flag for the compiler. This should be there by default, so did you do anything to modify the CPP flags you are sending the compiler, and if so, can you make sure -DCESMCOUPLED is present, at least when building the CICE model?

If you aren't sure what I am talking about, or need help modifying the compiler flags, please let me know.

Thanks, and have a good weekend!

Jesse
Thank you for practical advice. I will replace the number with {{ total_tasks }} in the later experiments.

On the other hand, i didnt understand the meaning of '-DCESMCOUPLED CPP flag ' you said. I cant find any information In the Google search, too. what i know is that i add the netcdf library path into the config_compilers.xml, and nothing else has changed, as is show in the attachment. Could you please tell me how to check/modify the compiler flags or where to find the guide?

Have a good weekend, too!
 

Attachments

  • config_compilers.xml.txt
    4.5 KB · Views: 7

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hello,

It looks like your config_compilers.xml file is missing these lines:

XML:
<!-- Define default values that can be overridden by specific
     compilers -->
<compiler>
  <CPPDEFS>
    <!-- This should be removed AFTER MOM6 cap is fully unified -->
    <append> -DCESMCOUPLED </append>
    <append MODEL="pop"> -D_USE_FLOW_CONTROL </append>
    <append MODEL="ufsatm"> -DSPMD </append>
  </CPPDEFS>

  <INCLDIR>
        <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append>
  </INCLDIR>
  <FFLAGS>
    <append MODEL="ufsatm"> $(FC_AUTO_R8) </append>
    <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append>
  </FFLAGS>
  <SUPPORTS_CXX>FALSE</SUPPORTS_CXX>
</compiler>

Try adding those lines to your config_compilers file and then re-building your case.

Also, another possible option is to add the -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse
 

ycliu

New Member
Hello,

It looks like your config_compilers.xml[/CODE] file is missing these lines: [CODE=xml]<!-- Define default values that can be overridden by specific compilers --> <compiler> <CPPDEFS> <!-- This should be removed AFTER MOM6 cap is fully unified --> <append> -DCESMCOUPLED </append> <append MODEL="pop"> -D_USE_FLOW_CONTROL </append> <append MODEL="ufsatm"> -DSPMD </append> </CPPDEFS> <INCLDIR> <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append> </INCLDIR> <FFLAGS> <append MODEL="ufsatm"> $(FC_AUTO_R8) </append> <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append> </FFLAGS> <SUPPORTS_CXX>FALSE</SUPPORTS_CXX> </compiler> [/CODE] Try adding those lines to your config_compilers file and then re-building your case. Also, another possible option is to add the [ICODE] -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse
Amazing! This kind error didnt show again after taking your advice. And it seems the initialization of all parts of model is normal. Thank you very much!

But a new error appear during the running(after ./case.submit[/CODE]: [CODE=bash] SHR_REPROSUM_CALC: Input contains 0.92160E+04 NaNs and 0.00000E+00 INFs on process 31 ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input Image PC Routine Line Source cesm.exe 0000000002C1709A Unknown Unknown Unknown cesm.exe 000000000274285E shr_abort_mod_mp_ 114 shr_abort_mod.F90 cesm.exe 000000000287199D shr_reprosum_mod_ 480 shr_reprosum_mod.F90 cesm.exe 000000000064A2CD par_xsum_ 72 par_xsum.F90 cesm.exe 0000000000F57A3F te_map_mod_mp_te_ 463 te_map.F90 cesm.exe 0000000000599CC5 dyn_comp_mp_dyn_r 2643 dyn_comp.F90 cesm.exe 0000000000F13714 stepon_mp_stepon_ 315 stepon.F90 cesm.exe 0000000000501F0B cam_comp_mp_cam_r 244 cam_comp.F90 cesm.exe 00000000004F3EB7 atm_comp_mct_mp_a 521 atm_comp_mct.F90 cesm.exe 0000000000435CFE component_mod_mp_ 737 component_mod.F90 cesm.exe 000000000041913C cime_comp_mod_mp_ 2823 cime_comp_mod.F90 cesm.exe 0000000000435987 MAIN__ 133 cime_driver.F90 cesm.exe 0000000000416B42 Unknown Unknown Unknown libc-2.27.so 0000146FAE068B97 __libc_start_main Unknown Unknown cesm.exe 0000000000416A2A Unknown Unknown Unknown [cli_31]: aborting job: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 31 mapz error - k0found i j k (kk,pe1,pe2) = -1 1 181 1 1 5.41842637234723 5.41842637234723 2 NaN 6.22092646824357 3 NaN 6.92342658297715 4 NaN 7.52592656868989 5 NaN 7.99748407922068 6 NaN 8.27570747777444 7 NaN 8.45775097219363 8 NaN 8.63480730648853 9 NaN 8.80688071369288 10 NaN 8.99592633509962 11 NaN 9.15842627823409 12 NaN 9.32092611044732 13 NaN 9.48342596191556 14 NaN 9.64592598145827 15 NaN 9.80842578148580 16 NaN NaN 17 NaN NaN 18 NaN NaN 19 NaN NaN 20 NaN NaN 21 NaN NaN 22 NaN NaN 23 NaN NaN 24 NaN NaN 25 NaN NaN 26 NaN NaN 27 NaN NaN 28 NaN NaN 29 NaN NaN 30 NaN NaN 31 NaN NaN 32 NaN NaN 33 NaN NaN ERROR: MAPZ_MODULE [/CODE] It seems like input data have some problems literally. But what puzzles me is that these input data is automatically downloaded by [ICODE]./check_input_data --download[/CODE]. On the other hand, i guess maybe my [ICODE]env_run.xml[/CODE] setting has some problems?(I totally use the default setting: STOP_OPTION=ndays,STOP_N=5). I put run log files[ICODE]cesm.log.201018-145403 [/CODE],[ICODE]cpl.log.201018-145403 [/CODE],and my [ICODE]env_run.xml[/CODE] in the attachment. Anyway, thanks~ Have a nice day!
 

Attachments

  • cesm.log.201018-145403.txt
    93 KB · Views: 3
  • cpl.log.201018-145403.txt
    77.9 KB · Views: 0
  • env_run.xml.txt
    69.3 KB · Views: 1

ycliu

New Member
Hello,

It looks like your config_compilers.xml file is missing these lines:

XML:
<!-- Define default values that can be overridden by specific
     compilers -->
<compiler>
  <CPPDEFS>
    <!-- This should be removed AFTER MOM6 cap is fully unified -->
    <append> -DCESMCOUPLED </append>
    <append MODEL="pop"> -D_USE_FLOW_CONTROL </append>
    <append MODEL="ufsatm"> -DSPMD </append>
  </CPPDEFS>

  <INCLDIR>
        <append MODEL="ufsatm"> -I$(EXEROOT)/atm/obj/FMS </append>
  </INCLDIR>
  <FFLAGS>
    <append MODEL="ufsatm"> $(FC_AUTO_R8) </append>
    <append MODEL="mom"> $(FC_AUTO_R8) -Duse_LARGEFILE</append>
  </FFLAGS>
  <SUPPORTS_CXX>FALSE</SUPPORTS_CXX>
</compiler>

Try adding those lines to your config_compilers file and then re-building your case.

Also, another possible option is to add the -DCESMCOUPLED line directly to the CICE_CPPDEFS variable in env_build.xml. However, the disadvantage of that method is that you will need to do that again for every new case you run.

Anyways, I hope that helps, and of course if that still doesn't work please let me know.

Thanks, and good luck with the re-build!

Jesse
Sorry, the first reply has format problem. only view this reply is ok.

Amazing! This kind error didnt show again after taking your advice. And it seems the initialization of all parts of model is normal. Thank you very much!

But a new error appear during the running(after ./case.submit):
Bash:
SHR_REPROSUM_CALC: Input contains  0.92160E+04 NaNs and  0.00000E+00 INFs on process      31
ERROR: shr_reprosum_calc ERROR: NaNs or INFs in input
Image              PC                Routine            Line        Source      
cesm.exe           0000000002C1709A  Unknown               Unknown  Unknown
cesm.exe           000000000274285E  shr_abort_mod_mp_         114  shr_abort_mod.F90
cesm.exe           000000000287199D  shr_reprosum_mod_         480  shr_reprosum_mod.F90
cesm.exe           000000000064A2CD  par_xsum_                  72  par_xsum.F90
cesm.exe           0000000000F57A3F  te_map_mod_mp_te_         463  te_map.F90
cesm.exe           0000000000599CC5  dyn_comp_mp_dyn_r        2643  dyn_comp.F90
cesm.exe           0000000000F13714  stepon_mp_stepon_         315  stepon.F90
cesm.exe           0000000000501F0B  cam_comp_mp_cam_r         244  cam_comp.F90
cesm.exe           00000000004F3EB7  atm_comp_mct_mp_a         521  atm_comp_mct.F90
cesm.exe           0000000000435CFE  component_mod_mp_         737  component_mod.F90
cesm.exe           000000000041913C  cime_comp_mod_mp_        2823  cime_comp_mod.F90
cesm.exe           0000000000435987  MAIN__                    133  cime_driver.F90
cesm.exe           0000000000416B42  Unknown               Unknown  Unknown
libc-2.27.so       0000146FAE068B97  __libc_start_main     Unknown  Unknown
cesm.exe           0000000000416A2A  Unknown               Unknown  Unknown
[cli_31]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 31
mapz error - k0found i j k (kk,pe1,pe2) =           -1           1         181
           1           1   5.41842637234723        5.41842637234723
           2                     NaN   6.22092646824357                3
                     NaN   6.92342658297715                4
                     NaN   7.52592656868989                5
                     NaN   7.99748407922068                6
                     NaN   8.27570747777444                7
                     NaN   8.45775097219363                8
                     NaN   8.63480730648853                9
                     NaN   8.80688071369288               10
                     NaN   8.99592633509962               11
                     NaN   9.15842627823409               12
                     NaN   9.32092611044732               13
                     NaN   9.48342596191556               14
                     NaN   9.64592598145827               15
                     NaN   9.80842578148580               16
                     NaN                     NaN          17
                     NaN                     NaN          18
                     NaN                     NaN          19
                     NaN                     NaN          20
                     NaN                     NaN          21
                     NaN                     NaN          22
                     NaN                     NaN          23
                     NaN                     NaN          24
                     NaN                     NaN          25
                     NaN                     NaN          26
                     NaN                     NaN          27
                     NaN                     NaN          28
                     NaN                     NaN          29
                     NaN                     NaN          30
                     NaN                     NaN          31
                     NaN                     NaN          32
                     NaN                     NaN          33
                     NaN                     NaN
ERROR: MAPZ_MODULE
It seems like input data have some problems literally. But what puzzles me is that these input data is automatically downloaded by ./check_input_data --download. On the other hand, i guess maybe my env_run.xml setting has some problems?(I totally use the default setting:
STOP_OPTION=ndays,STOP_N=5). I put run log filescesm.log.201018-145403 ,cpl.log.201018-145403 ,and my env_run.xml in the attachment.

Anyway, thanks~ Have a nice day!
 

Attachments

  • cesm.log.201018-145403.txt
    93 KB · Views: 3
  • cpl.log.201018-145403.txt
    77.9 KB · Views: 1
  • env_run.xml.txt
    69.3 KB · Views: 1

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hello,

It looks like this new error is because somewhere in your horizontal wind field (u and v), temperature field, and/or pressure field there are NaNs. Does this occur in the first time step (e.g., what's the largest value of nstep in atm.log.XXX)? If it is occurring on the first or second time step then it is likely an initialization issue.

For downloading input data, try running check_input_data again, except this time also include the --chksum flag, which will make the script check if something is different between your downloaded copies of the input data files and the versions on the NCAR server.

If the checksum passes, then it could indicate a problem with the model code or runtime environment. To test this I would try running an "X" case, as it is the simplest case CESM has, and then if that runs successfully trying an "A" case, and then if that runs ok trying a lower-resolution F2000 case (say with an f45_f45_mg37 grid). If all of those work then it would help narrow down what the real problem is.

Finally, this error does not appear to be caused by your env_run.xml modifications, at least as far as I can tell.

Anyways, good luck with the tests, and have a great day!

Jesse
 

ycliu

New Member
Hello,

It looks like this new error is because somewhere in your horizontal wind field (u and v), temperature field, and/or pressure field there are NaNs. Does this occur in the first time step (e.g., what's the largest value of nstep in atm.log.XXX)? If it is occurring on the first or second time step then it is likely an initialization issue.

For downloading input data, try running check_input_data again, except this time also include the --chksum flag, which will make the script check if something is different between your downloaded copies of the input data files and the versions on the NCAR server.

If the checksum passes, then it could indicate a problem with the model code or runtime environment. To test this I would try running an "X" case, as it is the simplest case CESM has, and then if that runs successfully trying an "A" case, and then if that runs ok trying a lower-resolution F2000 case (say with an f45_f45_mg37 grid). If all of those work then it would help narrow down what the real problem is.

Finally, this error does not appear to be caused by your env_run.xml modifications, at least as far as I can tell.

Anyways, good luck with the tests, and have a great day!

Jesse
Hello,

Thank you for replying again. I check the atm.log.XXX and find the largest value of nstep indeed equal to 1, seems it is likely an initialization issue as you say.

Then I use check_input_data --chksum as your order. All inputdata passed the check.

At the same time, I run "X" case, f19_g17. ./case.build is fine. But break again when ./case.submit. The cesm.log is:
Code:
....
(seq_comm_printcomms)     1     0     8     1  GLOBAL:
(seq_comm_printcomms)     2     0     8     1  CPL:
(seq_comm_printcomms)     3     0     8     1  ALLATMID:
(seq_comm_printcomms)     4     0     8     1  CPLALLATMID:
(seq_comm_printcomms)     5     0     8     1  ATM:
(seq_comm_printcomms)     6     0     8     1  CPLATM:
(seq_comm_printcomms)     7     0     8     1  ALLLNDID:
(seq_comm_printcomms)     8     0     8     1  CPLALLLNDID:
(seq_comm_printcomms)     9     0     8     1  LND:
(seq_comm_printcomms)    10     0     8     1  CPLLND:
(seq_comm_printcomms)    11     0     8     1  ALLICEID:
(seq_comm_printcomms)    12     0     8     1  CPLALLICEID:
(seq_comm_printcomms)    13     0     8     1  ICE:
(seq_comm_printcomms)    14     0     8     1  CPLICE:
(seq_comm_printcomms)    15     0     8     1  ALLOCNID:
(seq_comm_printcomms)    16     0     8     1  CPLALLOCNID:
(seq_comm_printcomms)    17     0     8     1  OCN:
(seq_comm_printcomms)    18     0     8     1  CPLOCN:
(seq_comm_printcomms)    19     0     8     1  ALLROFID:
(seq_comm_printcomms)    20     0     8     1  CPLALLROFID:
(seq_comm_printcomms)    21     0     8     1  ROF:
(seq_comm_printcomms)    22     0     8     1  CPLROF:
(seq_comm_printcomms)    23     0     8     1  ALLGLCID:
(seq_comm_printcomms)    24     0     8     1  CPLALLGLCID:
(seq_comm_printcomms)    25     0     8     1  GLC:
(seq_comm_printcomms)    26     0     8     1  CPLGLC:
(seq_comm_printcomms)    27     0     8     1  ALLWAVID:
(seq_comm_printcomms)    28     0     8     1  CPLALLWAVID:
(seq_comm_printcomms)    29     0     8     1  WAV:
(seq_comm_printcomms)    30     0     8     1  CPLWAV:
(seq_comm_printcomms)    31     0     8     1  ALLESPID:
(seq_comm_printcomms)    32     0     8     1  CPLALLESPID:
(seq_comm_printcomms)    33     0     8     1  ESP:
(seq_comm_printcomms)    34     0     8     1  CPLESP:
(seq_comm_printcomms)    35     0     8     1  ALLIACID:
(seq_comm_printcomms)    36     0     8     1  CPLALLIACID:
(seq_comm_printcomms)    37     0     8     1  IAC:
(seq_comm_printcomms)    38     0     8     1  CPLIAC:
 (t_initf) Read in prof_inparm namelist from: drv_in
 (t_initf) Using profile_disable=          F
 (t_initf)       profile_timer=                      4
 (t_initf)       profile_depth_limit=                4
 (t_initf)       profile_detail_limit=               2
 (t_initf)       profile_barrier=          F
 (t_initf)       profile_outpe_num=                  1
 (t_initf)       profile_outpe_stride=               0
 (t_initf)       profile_single_file=      F
 (t_initf)       profile_global_stats=     T
 (t_initf)       profile_ovhd_measurement= F
 (t_initf)       profile_add_detail=       F
 (t_initf)       profile_papi_enable=      F
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
000.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_0]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 0
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
001.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 1
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
002.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_2]: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
004.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
005.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_5]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
006.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_6]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 6
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
007.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_7]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 7
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 2
m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
003.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
[cli_3]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 3
[cli_4]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 2) - process 4
Are the two errors(X and F2000 case) caused by the same thing? it seems two case all generate Nan or incorrect values during calculations.

To many thanks, have a nice day!
 

Attachments

  • atm.log.201018-145403.txt
    378.4 KB · Views: 7

nusbaume

Jesse Nusbaumer
CSEG and Liaisons
Staff member
Hello,

I can't say for sure if the two errors are related, but in general if an "X" case doesn't run, then neither will any sort of "F" case. Sadly if an X case fails then it likely means that either the the model was ported incorrectly, or that one of the required libraries on your machine (e.g. MPI) is not installed properly. Given that it looks like someone stripped out a lot of the XML file information in CIME's config files, I would recommend re-downloading CESM2.2 and then re-trying the port, but this time just appending your machine onto the list without removing everything else, or try using the config_machines_template.xml file instead. I might also recommend following the porting instructions here:


In particular, I would make sure to try out the MPI example test to make sure your version of MPI is working properly.

Finally, I am not really an expert when it comes to porting CESM to new machines. Thus I have moved this thread to the infrastructure forum, where people who are more knowledgeable then I am might be able to provide more useful answers.

Good luck, and have a great day!

Jesse
 

ycliu

New Member
Hello,

I can't say for sure if the two errors are related, but in general if an "X" case doesn't run, then neither will any sort of "F" case. Sadly if an X case fails then it likely means that either the the model was ported incorrectly, or that one of the required libraries on your machine (e.g. MPI) is not installed properly. Given that it looks like someone stripped out a lot of the XML file information in CIME's config files, I would recommend re-downloading CESM2.2 and then re-trying the port, but this time just appending your machine onto the list without removing everything else, or try using the config_machines_template.xml file instead. I might also recommend following the porting instructions here:


In particular, I would make sure to try out the MPI example test to make sure your version of MPI is working properly.

Finally, I am not really an expert when it comes to porting CESM to new machines. Thus I have moved this thread to the infrastructure forum, where people who are more knowledgeable then I am might be able to provide more useful answers.

Good luck, and have a great day!

Jesse
Hello,

Thank you for offering so much information. Next I'm going to try reporting or switching to a different compiler and hopefully it will work.

Have a good day!
 
Top