Pio issue when reading in a 4-dim variable from netCDF

ep2764@columbia_edu · Jul 23, 2014

Hi all!
I am running into an issue with some code mods I have created. I am using CESM1.2.2. My code crashes when I try to read in a particular variable from a netCDF using pio. In this case the variable has 4 dimensions. I am not sure why it is crashing, I have copied the relevant code below:

real(r8), allocatable, dimension(:,:,:,:) :: epp_meped_flux
integer :: lev_did, lat_did, mlt_did, time_did, flux_vid, levels_vid, maglat_vid, mlt_vid, time_vid
...
... I get the dimensions in some code here without issue
...
...
allocate(epp_meped_flux(epp_meped_nMLTs, epp_meped_nLats, epp_meped_nLevs, epp_meped_nDates))
...
... I get several 1-dim variables here without issue
...
ierr = pio_inq_varid(fid, 'flux', flux_vid)
ierr = pio_get_var(fid, flux_vid, epp_meped_flux)
if (masterproc) then
write(iulog,*) 'epp_meped_param_init(): flux retrieved.'
endif

The program crashes somewhere between the start of the pio_inq_varid line and the write statement. I have similar blocks of code for all of my variables (all 1-dimensional), and they all retrieve the data without issue. This makes me think it has something to do with the variable being 4-dim. Is there a limitation in pio for the number of dimensions in a variable?

The error reported in the cesm log is:
ERROR: 0031-161 EOF on socket connection with node ys1809-ib
INFO: 0031-639 Exit status from pm_respond = -1
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory
INFO: 0031-029 Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory

There is no error reported in the atm log. I am attaching all of them to this post.
Thanks for any advice!

-Ethan

santos · Jul 23, 2014

That is usually the message given when there is a system error on yellowstone. Additionally, in your CESM log, it looks like it is trying to open a file with a nonsense name. Can you a) confirm that this is reproducible (i.e. if you run the exact same case twice, you get the same answer), and b) provide your namelist (or, if you have hard-coded the name of the file being read, the code that sets the file name and opens the file?).

ep2764@columbia_edu · Jul 23, 2014

I have attached my user_nl_cam file. I am not sure about the file issue since I know it reads the other variables in the file. My folders should be readable on yellowstone, so feel free to check out my SourceMods in the following directory as well:~epeck/case/peck_mee_test03/SourceMods/src.cam/. Thanks for any assistance you can provide! -Ethan

ep2764@columbia_edu · Jul 23, 2014

So something odd is definitely happening since now my code crashed in a different location. However I still get the nonsense filename in the log:

1: Opened existing file
1: /glade/p/cesmdata/cseg/inputdata/atm/waccm/solar/spes_1963-2012_c130307.nc
1: 655360
1: Opened existing file
1: ^@^@^@^@^@^@^@^@�g^G�^?^@^@�&@^Q^@^@^@^@ 9@^Q^@^@^@^@�g^G�^?^@^@H^A^Q^@^@^@^@^Q^Q^@^@^@^@pB^Q^@^@^@^@^A^@^@^@^@^@^@^@�"��E+^@
1: ^@^Q^Q^@^@^@^@^D^@^@^@^@^@^@^@��H^F^@^@^@^@0h^G�^?^@^@^@^@
1:^@^@^@^@^@~%��E+^@^@ 9@^Q^@^@^@^@| �^Q^@^@^@^@^E^@^@^@^@^@^@^@0h^G�^?
1: ^@^@^Q^Q^@^@^@^@^@^@^@^@^@^@^@^@^Q^Q^@^@^@^@^E^@^@^@^@^@^@^@^Q^Q^@^@^@^@^D~O^E^@^@^@^@��H^F^@^@^@^@^A^@^@^@^@^@^@^@^A^@^@^@^@^@^@^@��H^F^@
1: ^@^@^@^^^@^@^@^@^@^@^@^G^@^@^@^@^@^@^@ 720896
120:mlx4: local QP operation err (QPN 052396, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
60:mlx4: local QP operation err (QPN 042ed6, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
150:mlx4: local QP operation err (QPN 02eddf, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
135:mlx4: local QP operation err (QPN 01edca, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
165:mlx4: local QP operation err (QPN 023eb2, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
30:mlx4: local QP operation err (QPN 049487, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
90:mlx4: local QP operation err (QPN 0212ac, WQE index 1ec0000, vendor syndrome 6f, opcode = 5e)
45:mlx4: local QP operation err (QPN 067d83, WQE index 2610000, vendor syndrome 6f, opcode = 5e)
15:mlx4: local QP operation err (QPN 04e754, WQE index 2630000, vendor syndrome 6f, opcode = 5e)
75:mlx4: local QP operation err (QPN 045ac5, WQE index 2610000, vendor syndrome 6f, opcode = 5e)
105:mlx4: local QP operation err (QPN 0380d4, WQE index 26f0000, vendor syndrome 6f, opcode = 5e)
ERROR: 0031-161 EOF on socket connection with node ys1315-ib
INFO: 0031-639 Exit status from pm_respond = -1
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory
INFO: 0031-029 Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory

Not sure how to deal with that.

-Ethan

ep2764@columbia_edu · Jul 23, 2014

Just a quick update. The cesm.log has the crazy filename, but the atm.log does not have this issue, suggesting the masterproc is doing just fine. I have recreated the following attached logs a few times.

-Ethan

santos · Jul 23, 2014

Ah, I think I know what the issue is. You are only using getfil on masterproc (MPI rank 0). PIO doesn't typically use the masterproc as the main I/O task, even for serial operations. Get rid of your masterproc conditionals and see what happens.

ep2764@columbia_edu · Jul 23, 2014

Well that fixed the crazy name issue, but I still get the same Error message:

1: Opened existing file
1: /glade/u/home/epeck/POES_MEPED_Peck_maps_20030101_20050101.nc 720896
120:mlx4: local QP operation err (QPN 041dad, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
60:mlx4: local QP operation err (QPN 023dbf, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
150:mlx4: local QP operation err (QPN 05cf82, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
165:mlx4: local QP operation err (QPN 050d94, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
135:mlx4: local QP operation err (QPN 015d54, WQE index 2670000, vendor syndrome 6f, opcode = 5e)
30:mlx4: local QP operation err (QPN 012433, WQE index 1ee0000, vendor syndrome 6f, opcode = 5e)
90:mlx4: local QP operation err (QPN 060455, WQE index 1ec0000, vendor syndrome 6f, opcode = 5e)
45:mlx4: local QP operation err (QPN 0418ef, WQE index 2610000, vendor syndrome 6f, opcode = 5e)
15:mlx4: local QP operation err (QPN 01a237, WQE index 2630000, vendor syndrome 6f, opcode = 5e)
105:mlx4: local QP operation err (QPN 03ecab, WQE index 26f0000, vendor syndrome 6f, opcode = 5e)
75:mlx4: local QP operation err (QPN 063147, WQE index 2610000, vendor syndrome 6f, opcode = 5e)
ERROR: 0031-161 EOF on socket connection with node ys0520-ib
INFO: 0031-639 Exit status from pm_respond = -1
INFO: 0031-029 Caught signal 2 (Interrupt), sending to tasks...
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory
ERROR: 0031-028 pm_mgr_handle; can't send a signal message to remote nodes
ERROR: 0031-619 No such file or directory

-Ethan

santos · Jul 23, 2014

This looks like a system issue on Yellowstone, maybe to do with the interconnect. Would you open a ticket with CISL ( by emailing cislhelp@ucar.edu), and mention the "mlx4" messages? I'm not sure why you're the only one seeing this. Are you reading a very large amount of data all at once? This is data that you are reading into every single task.

ep2764@columbia_edu · Jul 23, 2014

It is not a small amount of data (2.4GB). I could try breaking it into daily files (instead of 2 years of data) like the MERRA data and see if that works. I will try that first, and if it does not work, I will open a ticket.
Thanks for all your help with this!

-Ethan

jedwards · Jul 24, 2014

You aren't going to be able to read 2.4 GB to each task. You need to subset your data, probably by date - it doesn't need to be in separate files, but just read in the subset of dates you need once per month for example. Can you use the generic tracer_data routines in chemistry/utils/tracer_data.F90?

ep2764@columbia_edu · Jul 26, 2014

By the time I had read this, I already split the data into daily files and wrote code to advance the date when necessary. I have never used tracer_data.F90 before. I have now run into a slightly different issue, but I will make that a new post when I finish trying to figure it out myself. Thanks for all the help! -Ethan

Pio issue when reading in a 4-dim variable from netCDF

ep2764@columbia_edu

Member

santos

Member

ep2764@columbia_edu

Member

ep2764@columbia_edu

Member

ep2764@columbia_edu

Member

santos

Member

ep2764@columbia_edu

Member

santos

Member

ep2764@columbia_edu

Member

jedwards

CSEG and Liaisons

ep2764@columbia_edu

Member