Scheduled Downtime
On Tuesday 24 October 2023 @ 5pm MT the forums will be in read only mode in preparation for the downtime. On Wednesday 25 October 2023 @ 5am MT, this website will be down for maintenance and expected to return online later in the morning.
Normal Operations
The forums are back online with normal operations. If you notice any issues or errors related to the forums, please reach out to help@ucar.edu

Input boundary SST dataset

I tried again, and the run crased again. 7605:ERROR 1 from file /project/sprelcot/build/rcots007a/src/ppe/lapi/Sam.cpp line 8797605:Sam::CheckTimeout TIMEOUT happened 7605:Error received in error handler.
 

santos

Member
I think that both cases are a system issue. I had a bit of trouble on yellowstone yesterday; for instance, one of my CESM jobs completed, but for some reason the mpirun command never exited, so eventually it ran over the time limit.
 

santos

Member
I think that both cases are a system issue. I had a bit of trouble on yellowstone yesterday; for instance, one of my CESM jobs completed, but for some reason the mpirun command never exited, so eventually it ran over the time limit.
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
The sam.cpp is usually a yellwostone and not a CESM issue.It is usually due to a defective node. It is more likely to hit these at high resolution (as you are using more nodes).

You should report the problem to CISL Support: cislhelp@ucar.edu. You will specify the job number. They will need this to find which node is defective
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
The sam.cpp is usually a yellwostone and not a CESM issue.It is usually due to a defective node. It is more likely to hit these at high resolution (as you are using more nodes).

You should report the problem to CISL Support: cislhelp@ucar.edu. You will specify the job number. They will need this to find which node is defective
 

Code:
The run is still unsuccesful. I have reported to CISL, and they are looking into it. <br /><br />The logfile also contains many lines that this: <br />2133:Client is created in Unreliable HW mode<br /><br />Does it has anything to do with the failure? <br /><br />BTW, the case directory is <span style="font-size: 10px;">/glade/u/home/yingli/cesm/runs/f.F2000C5.ne120_ne120.test.011</span>
 

Code:
The run is still unsuccesful. I have reported to CISL, and they are looking into it. <br /><br />The logfile also contains many lines that this: <br />2133:Client is created in Unreliable HW mode<br /><br />Does it has anything to do with the failure? <br /><br />BTW, the case directory is <span style="font-size: 10px;">/glade/u/home/yingli/cesm/runs/f.F2000C5.ne120_ne120.test.011</span>
 
Could you please let me know the reason to change the number of tasks to be multiple of 15? I have other similar runs and succesfully done, but the total number of tasks are not multiple of 15.Anyway, I did changed those numbers as suggested, but it failed again. 
 
Could you please let me know the reason to change the number of tasks to be multiple of 15? I have other similar runs and succesfully done, but the total number of tasks are not multiple of 15.Anyway, I did changed those numbers as suggested, but it failed again. 
 

jedwards

CSEG and Liaisons
Staff member
We run CESM using 15 tasks per node on yellowstone, if your pe count is not a multiple of 15 you leave a node partially idle, there is some evidence that this causes problems.   I don't see any explanation for your latest run failure, please resubmit.  
 

jedwards

CSEG and Liaisons
Staff member
We run CESM using 15 tasks per node on yellowstone, if your pe count is not a multiple of 15 you leave a node partially idle, there is some evidence that this causes problems.   I don't see any explanation for your latest run failure, please resubmit.  
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
It seems you have the error:
4783:ERROR 1 from file /project/sprelcot/build/rcots007a/src/ppe/lapi/Sam.cpp line 879
4783:Sam::CheckTimeout TIMEOUT happenedAs far as I know the sam timeouts are a bug in yellowstone.  What did CISL say about it ?
 

hannay

Cecile Hannay
AMWG Liaison
Staff member
It seems you have the error:
4783:ERROR 1 from file /project/sprelcot/build/rcots007a/src/ppe/lapi/Sam.cpp line 879
4783:Sam::CheckTimeout TIMEOUT happenedAs far as I know the sam timeouts are a bug in yellowstone.  What did CISL say about it ?
 
I have reported to CISL, and they are still looking into it. They haven't found anything related to a defective node so far. So I am wondering if there is any other possibilies except the system issue.  
 
I have reported to CISL, and they are still looking into it. They haven't found anything related to a defective node so far. So I am wondering if there is any other possibilies except the system issue.  
 

jedwards

CSEG and Liaisons
Staff member
I haven't found anything in your logs.   You could try runnining with DEBUG=TRUE in env_build.xml to see if you get more information (requires rebuilding)
 
Top