[Aces-support] Odd mpi errors relating to MNC package

chris hill cnh at mit.edu
Fri Dec 3 19:24:18 EST 2004


Daniel,

 Can you send a copy of your job script.

Thanks,

Chris
On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> Hey Ed [and others],
> 
> The issue with my jobs cutting out at 4am happened again last night. 
> Has this happened with anyone's else ITRDA jobs?  Should I continue 
> to expect this or this just an issue with getting ITRDA fully online? 
> Is it something about how my model trials are configured?  If so, 
> what can be done?
> 
> Cheers,
> Daniel
> 
> 
> >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> >>  I started three jobs last night around 1am within a range of about 20
> >>  minutes of each other.  They all came back with mpi errors (in the
> >>  pbs error files) relating to netcdf and mnc that read something like:
> >>
> >>
> >>  ABNORMAL END: package MNC
> >>  forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> >>  Image              PC        Routine            Line        Source
> >>  mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown
> >>
> >>  Stack trace terminated abnormally.
> >>       p4_error: latest msg from perror: Bad file descriptor
> >>
> >>
> >>  In the pbs standard out file, the problematic part looked like this:
> >>
> >>
> >>    NetCDF ERROR: No such file or directory
> >>    MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> >>  p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  rm_l_1_1848:  p4_error: listener select: -1
> >>  p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  P4 procgroup file is pr_group.
> >>
> >>
> >>  All the STDERR files are there but of zero size.  The STDOUT files
> >>  have nothing in them of note at the end (just the usual sea ice
> >>  monitor statistic for one of the packages that I am using).
> >>  Something else odd; they all seemed to break down at almost the exact
> >>  same time (even though I did not start then all within this close of
> >>  a time):
> >>
> >>
> >>  [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> >>  -rw-------  1 enderton aces 10409734 Dec  2 04:04 AquaC3O10/AqC3O10_C.e51362
> >>  -rw-------  1 enderton aces 55986184 Dec  2 04:04 AquaC3O10/AqC3O10_C.o51362
> >>  [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> >>  -rw-------  1 enderton aces 10104455 Dec  2 04:03 AquaC3O5/AqC3O5_C.e51361
> >>  -rw-------  1 enderton aces 54339284 Dec  2 04:03 AquaC3O5/AqC3O5_C.o51361
> >>  [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> >>  -rw-------  1 enderton aces  9799175 Dec  2 04:04 AquaC3O20/AqC3O20_C.e51363
> >>  -rw-------  1 enderton aces 52693228 Dec  2 04:04 AquaC3O20/AqC3O20_C.o51363
> >
> >
> >Hi Daniel,
> >
> >*Good* bug report!
> >
> >It looks like the kernel ran out of file descriptors.  It does not look
> >like a problem with MITgcm itself [and I'm not just saying that to pass
> >the blame off as the MNC author ;-)]
> >
> >The 4:04am time frame is very suspicious.  Its *right* after the system
> >usually kicks off some cron jobs that update the locate database, update
> >whereis, do the pre-linking, etc.  At these times the system can be very
> >heavily loaded and seems that it ran out of file descriptors ("file
> >handles").
> >
> >Heres a relevant link from the magic of Google:
> >
> >http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
> >
> >Ed
> >
> >--
> >Edward H. Hill III, PhD
> >office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
> >              Cambridge, MA 02139-4307
> >emails:  eh3 at mit.edu                ed at eh3.com
> >URLs:    http://web.mit.edu/eh3/    http://eh3.com/
> >phone:   617-253-0098
> >fax:     617-253-4464
> >
> >_______________________________________________
> >Aces-support mailing list
> >Aces-support at acesgrid.org
> >http://acesgrid.org/mailman/listinfo/aces-support
> 
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support




More information about the Aces-support mailing list