box_rearrange compute_dest error question

heavens · Oct 23, 2018

Can anyone explain the fundamentals of an error like, "box_rearrange::compute_dest:: ERROR: no destination found for compdof=7183393?" I've been running a pre-industrial control simulation on 180 Pleiades ivy-bridge processors with 1850_CAM5_CLM45_CICE_POP2_RTM_SGLC_SWAV -res 1.9x2.5_gx1v6 -mach pleiades-ivyThe only thing I change is some things about RRTMG. The code compiles and runs for three or four years. But when I try to put into production, I get errors like that above before the integrations start.But if I set: ./xmlchange -file env_run.xml -id DOUT_L_MS -val "FALSE" to turn off automatic long-term archiving, the simulation can be submitted again and will complete as normal. I can run long-term archiving outside the CESM run script without any issues and with no negative impact on the job.My situation is relatively unique and not easily reproducible, so I don't expect any help with resolving the error per se, but I have no idea why I am triggering this particular error in PIO. Best regards, Nicholas HeavensResearch Assistant Professor of Planetary ScienceHampton University

jedwards · Oct 23, 2018

The PIO error is that a compdof=7183393 was requested in the code but has no home on the file. the fv 1.9 grid has 96x144 horizontal points so most 3d fields that you would try to write would never have a compdof exceeding 96x144x32=442368. (The compdof is the offset from the begining of the field at a given time to the location this point should be written.) so that value would seem to indicate some kind of memory corruption in the model. Is it repeatable? If so can you figure out what component and variable are involved?
This has absolutly nothing to do with long term archiving - long term archiving does not affect the fortran executable in any way. It's possible that restarting the model resolves whatever memory corruption may have occured so that you can advance.

heavens · Oct 23, 2018

Thanks! The repeatability is an issue. But this should be a helpful clue should I find a clear pattern. Nick

heavens · Oct 24, 2018

Hi. This morning, this case crashed in the middle of writing an ocean file, without writing an error message. The ocean grid is 320*384*60=7372800. That's pretty close to 7183393. Except for modifications to CAM, the ocean is unmodified and run on processors I've run this exact configuration on for 180 model years or so. So I'm wondering if there's been a software update in the intervening time that is resulting in some weird behavior. Is there a useful debugging strategy for finding possible array mismatches?At this point, I'm going to just run a clean copy without any code modifications and see if I encounter similar errors.Nick

jedwards · Oct 24, 2018

I don't think you've mentioned what version of the model you are using. There are a couple of parameters that control the io strategy. Try changes in the OCN_PIO_STRIDE try setting it to use 1 task per node or 1 task per 2 nodes. I think on your system that should be 20 or 40.

heavens · Oct 24, 2018

I'm using CESM 1.2.2. Thanks for the advice! On Pleiades ivy, the default is 9 nodes total (180 processors). Nick

box_rearrange compute_dest error question

heavens

Member

jedwards

CSEG and Liaisons

heavens

Member

heavens

Member

jedwards

CSEG and Liaisons

heavens

Member