[Aces-support] Odd mpi errors relating to MNC package

Daniel Enderton enderton at MIT.EDU
Thu Dec 2 11:34:47 EST 2004


I started three jobs last night around 1am within a range of about 20 
minutes of each other.  They all came back with mpi errors (in the 
pbs error files) relating to netcdf and mnc that read something like:


ABNORMAL END: package MNC
forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
Image              PC        Routine            Line        Source
mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown

Stack trace terminated abnormally.
     p4_error: latest msg from perror: Bad file descriptor


In the pbs standard out file, the problematic part looked like this:


  NetCDF ERROR: No such file or directory
  MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
rm_l_1_1848:  p4_error: listener select: -1
p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
P4 procgroup file is pr_group.


All the STDERR files are there but of zero size.  The STDOUT files 
have nothing in them of note at the end (just the usual sea ice 
monitor statistic for one of the packages that I am using). 
Something else odd; they all seemed to break down at almost the exact 
same time (even though I did not start then all within this close of 
a time):


[enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
-rw-------  1 enderton aces 10409734 Dec  2 04:04 AquaC3O10/AqC3O10_C.e51362
-rw-------  1 enderton aces 55986184 Dec  2 04:04 AquaC3O10/AqC3O10_C.o51362
[enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
-rw-------  1 enderton aces 10104455 Dec  2 04:03 AquaC3O5/AqC3O5_C.e51361
-rw-------  1 enderton aces 54339284 Dec  2 04:03 AquaC3O5/AqC3O5_C.o51361
[enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
-rw-------  1 enderton aces  9799175 Dec  2 04:04 AquaC3O20/AqC3O20_C.e51363
-rw-------  1 enderton aces 52693228 Dec  2 04:04 AquaC3O20/AqC3O20_C.o51363


I double checked all my paths in my pbs scripts to make sure that 
runs were not calling trying to call the same executables of 
generally crossing paths, but everything looked fine.

Any ideas?

The running directories are:
/net/itrda/scratch-4/enderton/AquaC3O5
/net/itrda/scratch-4/enderton/AquaC3O10
/net/itrda/scratch-4/enderton/AquaC3O20


Cheers,
Daniel




More information about the Aces-support mailing list