Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

box_rearrange compute_dest error question

heavens

Member
Can anyone explain the fundamentals of an error like, "box_rearrange::compute_dest:: ERROR: no destination found for compdof=7183393?" I've been running a pre-industrial control simulation on 180 Pleiades ivy-bridge processors with 1850_CAM5_CLM45_CICE_POP2_RTM_SGLC_SWAV -res 1.9x2.5_gx1v6 -mach pleiades-ivyThe only thing I change is some things about RRTMG. The code compiles and runs for three or four years. But when I try to put into production, I get errors like that above before the integrations start.But if I set: ./xmlchange -file env_run.xml -id DOUT_L_MS -val "FALSE" to turn off automatic long-term archiving, the simulation can be submitted again and will complete as normal. I can run long-term archiving outside the CESM run script without any issues and with no negative impact on the job.My situation is relatively unique and not easily reproducible, so I don't expect any help with resolving the error per se, but I have no idea why I am triggering this particular error in PIO. Best regards, Nicholas HeavensResearch Assistant Professor of Planetary ScienceHampton University   
 

jedwards

CSEG and Liaisons
Staff member
The PIO error is that a compdof=7183393 was requested in the code but has no home on the file.   the fv 1.9 grid has 96x144 horizontal points so most 3d fields that you would try to write would never have a compdof exceeding 96x144x32=442368.    (The compdof is the offset from the begining of the field at a given time to the location this point should be written.) so that value would seem to indicate some kind of memory corruption in the model.  Is it repeatable?  If so can you figure out what component and variable are involved?   
This has absolutly nothing to do with long term archiving - long term archiving does not affect the fortran executable in any way.  It's possible that restarting the model resolves whatever memory corruption may have occured so that you can advance.
 

heavens

Member
Thanks! The repeatability is an issue. But this should be a helpful clue should I find a clear pattern. Nick
 

heavens

Member
Hi. This morning, this case crashed in the middle of writing an ocean file, without writing an error message. The ocean grid is 320*384*60=7372800. That's pretty close to 7183393.  Except for modifications to CAM, the ocean is unmodified and run on processors I've run this exact configuration on for 180 model years or so. So I'm wondering if there's been a software update in the intervening time that is resulting in some weird behavior.   Is there a useful debugging strategy for finding possible array mismatches?At this point, I'm going to just run a clean copy without any code modifications and see if I encounter similar errors.Nick 
 

jedwards

CSEG and Liaisons
Staff member
I don't think you've mentioned what version of the model you are using.   There are a couple of parameters that control the io strategy.   Try changes in the OCN_PIO_STRIDE try setting it to use 1 task per node or 1 task per 2 nodes.  I think on your system that should be 20 or 40.
 

heavens

Member
I'm using CESM 1.2.2. Thanks for the advice! On Pleiades ivy, the default is 9 nodes total (180 processors). Nick
 
Top