using nthread>1 and get "forrtl: severe (174): SIGSEGV, segmentation fault occurred"

kangwanying1992@gmail_com · Feb 6, 2015

I was trying to set nthread of some modules to be 4 instead of 1, while at the same time reduce the ntasks to 1/4 of original value. This is the only change between this one and an already succeeded one. I will be appreciate if anyone has any idea about what is going on and how to tune it...

santos · Feb 6, 2015

Please answer the following questions:

What version of CESM are you using?
What compiler version are you using?
What compset and resolution are you using?
Are you running on an NCAR-supported machine or one that you have ported the model to? What kind of machine is it?

Please also attach the cesm.log or ccsm.log where you found this error message.

kangwanying1992@gmail_com · Feb 6, 2015

Dear Sean, Thank you for your message. I use cesm1.0.5 this time, intel compiler, f19_f19 resolution and F_2000_WACCM compset, mpirun parallel. I ran it on our group's server, which is somehow like yellowstone structure except that we have 12 cpu per core.In fact, I have succeeded in running this model with nthread of all modules set to be one. I tried to change this parameter hoping to somehow let the model run faster... (WACCM was incredibly slow, I used 192 cpus (16 cores), and it took 5 days to get through 1 model year... That is abnormal, right?) Do you think change this can help accelerate the model? or it won't work?Following is the ccsm log file... Thank you so much!!!

santos · Feb 9, 2015

I think that you probably mean that you have 12 cpus per "node". The speed that you listed is definitely abnormally slow, since on yellowstone we can run this configuration with 180 cpus, with only one thread per task, and still get multiple years per day throughput.How many MPI tasks do you have per node? If you turn threading off, it sounds like you should have 12 per node.To debug a segfault like this, I would normally suggest running with "xmlchange DEBUG=TRUE" to find the problem, but if your runs are that slow, it may be that there's a problem with your pe layout that needs to be fixed first. I would also suggest using CESM 1.0.6 if you can, since I believe there are a few minor bug fixes, and it's possible that one of them is relevant.

kangwanying1992@gmail_com · Feb 9, 2015

Dear Santos,I have 12 cpus per node. I turned off the threading and have 12 tasks per node (192 tasks in total). Each module (CAM, CICE, CLD and etc) is diveded into 192 tasks*1 thread. I checked the timing output and found that CAM comsumed almost all of the computer time. Do you think this is normal?Best regards, Wanying

santos · Feb 9, 2015

"I checked the timing output and found that CAM comsumed almost all of the computer time. Do you think this is normal?"Yes, for F_2000_WACCM, almost all of the time will be spent on CAM.I am not sure why the model is so slow. It may have to do with your cluster's hardware or configuration (e.g. system daemons running on each nodes, or slow communication or I/O). As I mentioned on the other topic, running specified chemistry should be faster.

kangwanying1992@gmail_com · Feb 9, 2015

Dear Santos,Thank you so much for your patient explanation!I will try to use yellowstone server instead, see if it will make a difference.Wanying

wanlingfeng_123@163_com · Sep 1, 2016

Dear santos I want do past ten thousand simulation with CESM1_0_4, use B compset, because of the long time so i want to do a accelerate simulation, but i don't konw how to set the model to do it? can you give me some advise? thank yourwanlingfeng

using nthread>1 and get "forrtl: severe (174): SIGSEGV, segmentation fault occurred"

kangwanying1992@gmail_com

Member

santos

Member

kangwanying1992@gmail_com

Member

santos

Member

kangwanying1992@gmail_com

Member

santos

Member

kangwanying1992@gmail_com

Member

wanlingfeng_123@163_com

New Member