Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

using nthread>1 and get "forrtl: severe (174): SIGSEGV, segmentation fault occurred"

I was trying to set nthread of some modules to be 4 instead of 1, while at the same time reduce the ntasks to 1/4 of original value. This is the only change between this one and an already succeeded one. I will be appreciate if anyone has any idea about what is going on and how to tune it...
 

santos

Member
Please answer the following questions:
  • What version of CESM are you using?
  • What compiler version are you using?
  • What compset and resolution are you using?
  • Are you running on an NCAR-supported machine or one that you have ported the model to? What kind of machine is it?
Please also attach the cesm.log or ccsm.log where you found this error message. 
 
Dear Sean, Thank you for your message. I use cesm1.0.5 this time, intel compiler, f19_f19 resolution and F_2000_WACCM compset, mpirun parallel. I ran it on our group's server, which is somehow like yellowstone structure except that we have 12 cpu per core.In fact, I have succeeded in running this model with nthread of all modules set to be one. I tried to change this parameter hoping to somehow let the model run faster... (WACCM was incredibly slow, I used 192 cpus (16 cores), and it took 5 days to get through 1 model year... That is abnormal, right?) Do you think change this can help accelerate the model? or it won't work?Following is the ccsm log file... Thank you so much!!!
 

santos

Member
I think that you probably mean that you have 12 cpus per "node". The speed that you listed is definitely abnormally slow, since on yellowstone we can run this configuration with 180 cpus, with only one thread per task, and still get multiple years per day throughput.How many MPI tasks do you have per node? If you turn threading off, it sounds like you should have 12 per node.To debug a segfault like this, I would normally suggest running with "xmlchange DEBUG=TRUE" to find the problem, but if your runs are that slow, it may be that there's a problem with your pe layout that needs to be fixed first. I would also suggest using CESM 1.0.6 if you can, since I believe there are a few minor bug fixes, and it's possible that one of them is relevant.
 
Dear Santos,I have 12 cpus per node. I turned off the threading and have 12 tasks per node (192 tasks in total). Each module (CAM, CICE, CLD and etc) is diveded into 192 tasks*1 thread. I checked the timing output and found that CAM comsumed almost all of the computer time. Do you think this is normal?Best regards, Wanying
 

santos

Member
"I checked the timing output and found that CAM comsumed almost all of the computer time. Do you think this is normal?"Yes, for F_2000_WACCM, almost all of the time will be spent on CAM.I am not sure why the model is so slow. It may have to do with your cluster's hardware or configuration (e.g. system daemons running on each nodes, or slow communication or I/O). As I mentioned on the other topic, running specified chemistry should be faster.
 
Dear santos    I want do past ten thousand simulation with CESM1_0_4, use B compset, because of the long time so i want to do a accelerate simulation, but i don't konw how to set the model to do it? can you give me some advise? thank yourwanlingfeng
 
Top