jnjohnson@lbl_gov
New Member
Hi everyone,
The CASCADE project has constructed some small test problems that we are using to understand some issues that arise on Edison since its new hardware configuration. One of them is a case configured with the F_AMIP_CAM5 compset on 192 processes that dies with a segmentation fault somewhere in the domain decomposition stage for CAM, just a few minutes after starting up. I have been trying without success to get more information about what causes the crash, and am wondering if you guys can reproduce this issue using NERSC's copy of CESM 1.2.2 and the following materials. To create the case, run the following on Edison in the /global/homes/j/johnson/CASCADE_test_cases directory:csh CASCADE_test_case1.csh This generates and submits the case. If you are game to try this, you might want to cancel that submission and change the .run script to use the debug queue, as Edison and Cori's queues are even more backed-up than usual these days. The run will die with a segmentation fault as mentioned. I have successfully run the DDT debugger in offline mode by replacing the line srun --label --ntasks=192 --cpu_bind=sockets --cpu_bind=verbose --kill-on-bad-exit $EXEROOT/cesm.exe >&! cesm.log.$LIDin the .run script with
ddt --offline=output.html -np 192 $EXEROOT/cesm.exe >&! cesm.log.$LID
which writes a file output.html to the run/ directory, which can be opened in the browser and will show a stack trace. Unfortunately I don't have an output.html example handy to show you. Can anyone reproduce this crash, and does this ring any bells for anyone? This is holding us up, and we have engaged people at NERSC, but it's been slow going and we would greatly appreciate any insight. Please let me know if this isn't a good enough description of the case/problem, or if you need any more materials to check it out. Best,Jeffrey JohnsonLawrence Berkeley Laboratory
The CASCADE project has constructed some small test problems that we are using to understand some issues that arise on Edison since its new hardware configuration. One of them is a case configured with the F_AMIP_CAM5 compset on 192 processes that dies with a segmentation fault somewhere in the domain decomposition stage for CAM, just a few minutes after starting up. I have been trying without success to get more information about what causes the crash, and am wondering if you guys can reproduce this issue using NERSC's copy of CESM 1.2.2 and the following materials. To create the case, run the following on Edison in the /global/homes/j/johnson/CASCADE_test_cases directory:csh CASCADE_test_case1.csh This generates and submits the case. If you are game to try this, you might want to cancel that submission and change the .run script to use the debug queue, as Edison and Cori's queues are even more backed-up than usual these days. The run will die with a segmentation fault as mentioned. I have successfully run the DDT debugger in offline mode by replacing the line srun --label --ntasks=192 --cpu_bind=sockets --cpu_bind=verbose --kill-on-bad-exit $EXEROOT/cesm.exe >&! cesm.log.$LIDin the .run script with
ddt --offline=output.html -np 192 $EXEROOT/cesm.exe >&! cesm.log.$LID
which writes a file output.html to the run/ directory, which can be opened in the browser and will show a stack trace. Unfortunately I don't have an output.html example handy to show you. Can anyone reproduce this crash, and does this ring any bells for anyone? This is holding us up, and we have engaged people at NERSC, but it's been slow going and we would greatly appreciate any insight. Please let me know if this isn't a good enough description of the case/problem, or if you need any more materials to check it out. Best,Jeffrey JohnsonLawrence Berkeley Laboratory