what is the status of my run?

thibaut_lurton@cnrs-orleans_fr · Oct 10, 2014

Hi to all,First post here, and pretty new to CESM. We are porting v.1.2.2 to a new machine.I've been running into the exact same problem as mentioned above: I'm creating and running the test cases as detailed page 55 of the user's guide, in the order specified; the first five tests passed OK (provided numbers 2 and 5 are added the _rx1 extension), and I'm now encountering cumbersome computation times for number 6 (namely ERI.f19_g16.B1850CN).My latest attempt was with a wall-time limit of 4 hours, 80 processes 4 nodes, and the job was killed before ending.I don't reckon I have any particular error message in the logs but I can provide them if needed.Test 3, also in B1850CN mode, was the longest of all so far, running in approximately 2 hours, versus a few minutes for all other four tests. So my question was whether it was expected that B1850CN-type tests have a pretty long duration, and therefore I should try pushing the time limit a bit further, or if something was wrong elsewhere...Thanks for any suggestion/help.

thibaut_lurton@cnrs-orleans_fr · Oct 10, 2014

Hi to all,First post here, and pretty new to CESM. We are porting v.1.2.2 to a new machine.I've been running into the exact same problem as mentioned above: I'm creating and running the test cases as detailed page 55 of the user's guide, in the order specified; the first five tests passed OK (provided numbers 2 and 5 are added the _rx1 extension), and I'm now encountering cumbersome computation times for number 6 (namely ERI.f19_g16.B1850CN).My latest attempt was with a wall-time limit of 4 hours, 80 processes 4 nodes, and the job was killed before ending.I don't reckon I have any particular error message in the logs but I can provide them if needed.Test 3, also in B1850CN mode, was the longest of all so far, running in approximately 2 hours, versus a few minutes for all other four tests. So my question was whether it was expected that B1850CN-type tests have a pretty long duration, and therefore I should try pushing the time limit a bit further, or if something was wrong elsewhere...Thanks for any suggestion/help.

santos · Oct 13, 2014

For the tests listed in the users guide, the A and X compsets are especially cheap (being all "dead" and prescribed data models), whereas B1850CN is closer to a commonly run coupled case (having active atmosphere, land, and ocean). So we definitely expect that a B test will take many times longer to run than A and X tests of the same run length.An ERI test also sets up and runs multiple cases (I admit that I forget the details of these), so it takes longer than an ERS test. So if the third test took 2 hours, I would try allowing a wall time of 6-8 hours for the ERI test (it may not be quite that long, but just in case). The purpose of including A and X in the porting tests is to simply ensure that the coupler and data models work correctly, whereas the purpose is of including a B1850CN case is to have a test case that more closely resembles what would be used in a production run.Edit: I would simply allow more time for the ERI test, but if it absolutely must run in a shorter time, I think that you can use "ERI_Ld3.f19_g16.B1850CN" instead, which will run a similar test for only 3 simulated days. If you are worried that the run is not actually making progress at all, you can try checking the ends of the cpl or atm logs, or attach a full copy of those logs here.

santos · Oct 13, 2014

For the tests listed in the users guide, the A and X compsets are especially cheap (being all "dead" and prescribed data models), whereas B1850CN is closer to a commonly run coupled case (having active atmosphere, land, and ocean). So we definitely expect that a B test will take many times longer to run than A and X tests of the same run length.An ERI test also sets up and runs multiple cases (I admit that I forget the details of these), so it takes longer than an ERS test. So if the third test took 2 hours, I would try allowing a wall time of 6-8 hours for the ERI test (it may not be quite that long, but just in case). The purpose of including A and X in the porting tests is to simply ensure that the coupler and data models work correctly, whereas the purpose is of including a B1850CN case is to have a test case that more closely resembles what would be used in a production run.Edit: I would simply allow more time for the ERI test, but if it absolutely must run in a shorter time, I think that you can use "ERI_Ld3.f19_g16.B1850CN" instead, which will run a similar test for only 3 simulated days. If you are worried that the run is not actually making progress at all, you can try checking the ends of the cpl or atm logs, or attach a full copy of those logs here.

santos · Oct 13, 2014

For the tests listed in the users guide, the A and X compsets are especially cheap (being all "dead" and prescribed data models), whereas B1850CN is closer to a commonly run coupled case (having active atmosphere, land, and ocean). So we definitely expect that a B test will take many times longer to run than A and X tests of the same run length.An ERI test also sets up and runs multiple cases (I admit that I forget the details of these), so it takes longer than an ERS test. So if the third test took 2 hours, I would try allowing a wall time of 6-8 hours for the ERI test (it may not be quite that long, but just in case). The purpose of including A and X in the porting tests is to simply ensure that the coupler and data models work correctly, whereas the purpose is of including a B1850CN case is to have a test case that more closely resembles what would be used in a production run.Edit: I would simply allow more time for the ERI test, but if it absolutely must run in a shorter time, I think that you can use "ERI_Ld3.f19_g16.B1850CN" instead, which will run a similar test for only 3 simulated days. If you are worried that the run is not actually making progress at all, you can try checking the ends of the cpl or atm logs, or attach a full copy of those logs here.

thibaut_lurton@cnrs-orleans_fr · Oct 16, 2014

Thanks Sean for your quick answer, clear explanation and advice.Our ERI test was re-run with an 8-h wall-time, but it kept exceeding that duration, which is a bit puzzling.As for the logs, I could only find bldlog’s, not the awaited cpl.log.* or atm.log.*. Maybe that can tell something about the job not progressing at all?Meanwhile, we also tried “ERI_Ld3.f19_g16.B1850CN”, but which would almost immediately fail due to “run length too short”.

Cheers,
Thibaut.

thibaut_lurton@cnrs-orleans_fr · Oct 16, 2014

Thanks Sean for your quick answer, clear explanation and advice.Our ERI test was re-run with an 8-h wall-time, but it kept exceeding that duration, which is a bit puzzling.As for the logs, I could only find bldlog’s, not the awaited cpl.log.* or atm.log.*. Maybe that can tell something about the job not progressing at all?Meanwhile, we also tried “ERI_Ld3.f19_g16.B1850CN”, but which would almost immediately fail due to “run length too short”.

Cheers,
Thibaut.

thibaut_lurton@cnrs-orleans_fr · Oct 16, 2014

Thanks Sean for your quick answer, clear explanation and advice.Our ERI test was re-run with an 8-h wall-time, but it kept exceeding that duration, which is a bit puzzling.As for the logs, I could only find bldlog’s, not the awaited cpl.log.* or atm.log.*. Maybe that can tell something about the job not progressing at all?Meanwhile, we also tried “ERI_Ld3.f19_g16.B1850CN”, but which would almost immediately fail due to “run length too short”.

Cheers,
Thibaut.

santos · Oct 16, 2014

If your run does not complete, it won't have the chance to copy anything to your case's "logs" directory. In that case, all of the logs (cpl.log, atm.log, cesm.log) will be in your run directory.If the run aborts (and assuming that you've set up your mkbatch similar to ours), cesm.log always has the abort message (though there are also many warnings in that log that are *not* errors; model abort messages always start with the string "ERROR"). Normally an abort should kill the job rather than letting it run, but every once in a while you'll find an exception.If the run is slow or hangs, the quickest way to check its progress is by looking at the end of cpl.log, which prints a message every model day.I don't remember what the minimum ERI test length is for CESM 1.2.2, but I guess three days is too short. You could instead try a week:“ERI_Ld7.f19_g16.B1850CN”

santos · Oct 16, 2014

If your run does not complete, it won't have the chance to copy anything to your case's "logs" directory. In that case, all of the logs (cpl.log, atm.log, cesm.log) will be in your run directory.If the run aborts (and assuming that you've set up your mkbatch similar to ours), cesm.log always has the abort message (though there are also many warnings in that log that are *not* errors; model abort messages always start with the string "ERROR"). Normally an abort should kill the job rather than letting it run, but every once in a while you'll find an exception.If the run is slow or hangs, the quickest way to check its progress is by looking at the end of cpl.log, which prints a message every model day.I don't remember what the minimum ERI test length is for CESM 1.2.2, but I guess three days is too short. You could instead try a week:“ERI_Ld7.f19_g16.B1850CN”

santos · Oct 16, 2014

If your run does not complete, it won't have the chance to copy anything to your case's "logs" directory. In that case, all of the logs (cpl.log, atm.log, cesm.log) will be in your run directory.If the run aborts (and assuming that you've set up your mkbatch similar to ours), cesm.log always has the abort message (though there are also many warnings in that log that are *not* errors; model abort messages always start with the string "ERROR"). Normally an abort should kill the job rather than letting it run, but every once in a while you'll find an exception.If the run is slow or hangs, the quickest way to check its progress is by looking at the end of cpl.log, which prints a message every model day.I don't remember what the minimum ERI test length is for CESM 1.2.2, but I guess three days is too short. You could instead try a week:“ERI_Ld7.f19_g16.B1850CN”

thibaut_lurton@cnrs-orleans_fr · Oct 20, 2014

Thanks again Sean for your answer.
Just for information, apparently _Ld7 is still too short; I tried instead _Ld12, which would start running ok, but again with no termination.Back to the regular ERI.f19_g16.B1850CN test, I re-run it with a 1-h wall-limit, just to check the behaviour of the logs. From what I can read in cpl.log, I reckon that there is no real progression in the run, but I'm asking for confirmation; I'm attaching the atm.log, cpl.log and cesm.log files generated by this run. Note that these logs are taken from the ERI.****.ref1/run/ directory, not the main /run/ one; as a matter of fact, that's the only location where I can find proper run logs (i.e. not bldlog's).Thanks again for any help, it is much appreciated.

thibaut_lurton@cnrs-orleans_fr · Oct 20, 2014

Thanks again Sean for your answer.
Just for information, apparently _Ld7 is still too short; I tried instead _Ld12, which would start running ok, but again with no termination.Back to the regular ERI.f19_g16.B1850CN test, I re-run it with a 1-h wall-limit, just to check the behaviour of the logs. From what I can read in cpl.log, I reckon that there is no real progression in the run, but I'm asking for confirmation; I'm attaching the atm.log, cpl.log and cesm.log files generated by this run. Note that these logs are taken from the ERI.****.ref1/run/ directory, not the main /run/ one; as a matter of fact, that's the only location where I can find proper run logs (i.e. not bldlog's).Thanks again for any help, it is much appreciated.

thibaut_lurton@cnrs-orleans_fr · Oct 20, 2014

Thanks again Sean for your answer.
Just for information, apparently _Ld7 is still too short; I tried instead _Ld12, which would start running ok, but again with no termination.Back to the regular ERI.f19_g16.B1850CN test, I re-run it with a 1-h wall-limit, just to check the behaviour of the logs. From what I can read in cpl.log, I reckon that there is no real progression in the run, but I'm asking for confirmation; I'm attaching the atm.log, cpl.log and cesm.log files generated by this run. Note that these logs are taken from the ERI.****.ref1/run/ directory, not the main /run/ one; as a matter of fact, that's the only location where I can find proper run logs (i.e. not bldlog's).Thanks again for any help, it is much appreciated.

jedwards · Oct 20, 2014

According to the cesm log you are crashing with an out of memory error. This is happening in the initialiization step, if the model isn't exiting immediately it's because your MPI layer is not properly handling errors. Check your environment limits and make sure they are set to use all available memory. When you next run watch the cesm log, if you see a message like "cesm.exe:5978 terminated with signal 11 at PC=2aaaadca24c3 SP=7fffffff2a70. Backtrace: ", the model has failed and no further progress will be made.

jedwards · Oct 20, 2014

According to the cesm log you are crashing with an out of memory error. This is happening in the initialiization step, if the model isn't exiting immediately it's because your MPI layer is not properly handling errors. Check your environment limits and make sure they are set to use all available memory. When you next run watch the cesm log, if you see a message like "cesm.exe:5978 terminated with signal 11 at PC=2aaaadca24c3 SP=7fffffff2a70. Backtrace: ", the model has failed and no further progress will be made.

jedwards · Oct 20, 2014

According to the cesm log you are crashing with an out of memory error. This is happening in the initialiization step, if the model isn't exiting immediately it's because your MPI layer is not properly handling errors. Check your environment limits and make sure they are set to use all available memory. When you next run watch the cesm log, if you see a message like "cesm.exe:5978 terminated with signal 11 at PC=2aaaadca24c3 SP=7fffffff2a70. Backtrace: ", the model has failed and no further progress will be made.

thibaut_lurton@cnrs-orleans_fr · Nov 3, 2014

Thanks Jim.Now that is odd, because first ---after having checked--- the global memory variables on our machine are all set to "unlimited", and second, according to our local cluster engineer, very little memory was ever used on the period corresponding to the beginning of the run...I'm trying to identify and sort out the problem by running a _Ld9 version of the test. (9 days seems to be long enough to run properly.)I reckon some kind of initialisation problem occurs, but not necessarily an out-of-memory issue... I shall keep you posted when I find out.

thibaut_lurton@cnrs-orleans_fr · Nov 3, 2014

Thanks Jim.Now that is odd, because first ---after having checked--- the global memory variables on our machine are all set to "unlimited", and second, according to our local cluster engineer, very little memory was ever used on the period corresponding to the beginning of the run...I'm trying to identify and sort out the problem by running a _Ld9 version of the test. (9 days seems to be long enough to run properly.)I reckon some kind of initialisation problem occurs, but not necessarily an out-of-memory issue... I shall keep you posted when I find out.

thibaut_lurton@cnrs-orleans_fr · Nov 3, 2014

Thanks Jim.Now that is odd, because first ---after having checked--- the global memory variables on our machine are all set to "unlimited", and second, according to our local cluster engineer, very little memory was ever used on the period corresponding to the beginning of the run...I'm trying to identify and sort out the problem by running a _Ld9 version of the test. (9 days seems to be long enough to run properly.)I reckon some kind of initialisation problem occurs, but not necessarily an out-of-memory issue... I shall keep you posted when I find out.

what is the status of my run?

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

Member

Member

Member