[Aces-support] Odd mpi errors relating to MNC package

Fri Dec 3 18:27:26 EST 2004

Hey Ed [and others],

The issue with my jobs cutting out at 4am happened again last night. 
Has this happened with anyone's else ITRDA jobs?  Should I continue 
to expect this or this just an issue with getting ITRDA fully online? 
Is it something about how my model trials are configured?  If so, 
what can be done?

Cheers,
Daniel

>On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
>>  I started three jobs last night around 1am within a range of about 20
>>  minutes of each other.  They all came back with mpi errors (in the
>>  pbs error files) relating to netcdf and mnc that read something like:
>>
>>
>>  ABNORMAL END: package MNC
>>  forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
>>  Image              PC        Routine            Line        Source
>>  mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown
>>
>>  Stack trace terminated abnormally.
>>       p4_error: latest msg from perror: Bad file descriptor
>>
>>
>>  In the pbs standard out file, the problematic part looked like this:
>>
>>
>>    NetCDF ERROR: No such file or directory
>>    MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
>>  p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
>>  p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
>>  p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
>>  p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
>>  p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
>>  rm_l_1_1848:  p4_error: listener select: -1
>>  p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
>>  P4 procgroup file is pr_group.
>>
>>
>>  All the STDERR files are there but of zero size.  The STDOUT files
>>  have nothing in them of note at the end (just the usual sea ice
>>  monitor statistic for one of the packages that I am using).
>>  Something else odd; they all seemed to break down at almost the exact
>>  same time (even though I did not start then all within this close of
>>  a time):
>>
>>
>>  [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
>>  -rw-------  1 enderton aces 10409734 Dec  2 04:04 AquaC3O10/AqC3O10_C.e51362
>>  -rw-------  1 enderton aces 55986184 Dec  2 04:04 AquaC3O10/AqC3O10_C.o51362
>>  [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
>>  -rw-------  1 enderton aces 10104455 Dec  2 04:03 AquaC3O5/AqC3O5_C.e51361
>>  -rw-------  1 enderton aces 54339284 Dec  2 04:03 AquaC3O5/AqC3O5_C.o51361
>>  [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
>>  -rw-------  1 enderton aces  9799175 Dec  2 04:04 AquaC3O20/AqC3O20_C.e51363
>>  -rw-------  1 enderton aces 52693228 Dec  2 04:04 AquaC3O20/AqC3O20_C.o51363
>
>
>Hi Daniel,
>
>*Good* bug report!
>
>It looks like the kernel ran out of file descriptors.  It does not look
>like a problem with MITgcm itself [and I'm not just saying that to pass
>the blame off as the MNC author ;-)]
>
>The 4:04am time frame is very suspicious.  Its *right* after the system
>usually kicks off some cron jobs that update the locate database, update
>whereis, do the pre-linking, etc.  At these times the system can be very
>heavily loaded and seems that it ran out of file descriptors ("file
>handles").
>
>Heres a relevant link from the magic of Google:
>
>http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
>
>Ed
>
>--
>Edward H. Hill III, PhD
>office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
>              Cambridge, MA 02139-4307
>emails:  eh3 at mit.edu                ed at eh3.com
>URLs:    http://web.mit.edu/eh3/    http://eh3.com/
>phone:   617-253-0098
>fax:     617-253-4464
>
>_______________________________________________
>Aces-support mailing list
>Aces-support at acesgrid.org
>http://acesgrid.org/mailman/listinfo/aces-support