Large restart file output error in a high resolution configuration for CAM5-EUL dynamic core

eaton · Jun 26, 2013

I would try to double the nodes by requesting 68 nodes and only assign 6 tasks per node. This gives each task twice as much memory as in the successful configuration used for the 30 level grid. I would also try setting the namelist variable atm_pio_stride=6. This will put just 1 pio task on each node which should minimize the overhead incurred when writing the restart file.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

zhangyi@lasg_iap_ac_cn · Jun 27, 2013

I've tested in this way, but it still fails.I use 400 cpus on 80 nodes, so 5 cpus/node, and also set pio_stride = 5, so pio_numtasks = 80The model still hangs at the same place.Is it possible for me to break the pbuf var according to their first dim (ldim)?It is explicit to do this in restart_dynamics, but I don't know how in restart_physics.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

eaton · Jun 28, 2013

The fact that the run hangs in the same place after you've provided much more memory per task is indicating that this isn't a memory problem. Still it would be worthwhile using performance tools on your system to determine the memory highwater marks for the tasks using the different configurations and making sure that everything makes sense.If it isn't a memory problem then presumably it's a problem with the io library, which is the path being persued in the other thread in this topic. I'd at the very least be sure that I was using the latest release of the netcdf lib. This should allow you to write netcdf files of the required size.My feeling right now is that this is a system problem. I've looked at the physics buffer fields in cesm1_0_5 and don't see any that are candidates for being split up, i.e., there aren't any fields with a pcnst dimension. We've already addressed those kinds of problems as a result of doing much higher resolution runs than what you're attempting here.

zhangyi@lasg_iap_ac_cn · Jun 28, 2013

Eaton, thanks for your kind suggestion, I also prepare to run it on other platform (currently Linux).

zhangyi@lasg_iap_ac_cn · Jun 28, 2013

Eaton, thanks for your kind suggestion, I also prepare to run it on other platform (currently Linux).

zhangyi@lasg_iap_ac_cn · Jun 28, 2013

Eaton, thanks for your kind suggestion, I also prepare to run it on other platform (currently Linux).

Large restart file output error in a high resolution configuration for CAM5-EUL dynamic core

CSEG and Liaisons

Member

Member

Member

Member

Member

Member

Member

Member

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

CSEG and Liaisons

Member

Member

Member