[Aces-support] Odd mpi errors relating to MNC package

Sat Dec 4 12:03:13 EST 2004

Hello,

I'm just wondering if anyone knows when itrda will be up again. I have my jobs
run last night and want to get the results out.

Thanks,

Chinnawat

Quoting Daniel Enderton <enderton at MIT.EDU>:

> Chris [Ed, and others] --
> 
> ITRDA seems to be down right now (not getting anything from ping), 
> but the three jobs that are currently being run which are cutting out 
> at 4am (can't see if it happened last night since I can't get in) are 
> in the following directories:
> 
> /net/itrda/scratch-4/enderton/AquaC3O[5,10,20]/
> 
> In each of the above three directories there is a runCpl (running the 
> coupled component) and a runOcn (running the ocean only component) 
> script.  They are the two scripts that are run (calling the other 
> when finished), and both encounter the problem.  The coupled scripts 
> run in the rank_[0-7] directories and the runOcn runs in the ocn_only 
> directory.
> 
> Cheers,
> Daniel
> 
> >Daniel,
> >
> >  Can you send a copy of your job script.
> >
> >Thanks,
> >
> >Chris
> >On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> >>  Hey Ed [and others],
> >>
> >>  The issue with my jobs cutting out at 4am happened again last night.
> >>  Has this happened with anyone's else ITRDA jobs?  Should I continue
> >>  to expect this or this just an issue with getting ITRDA fully online?
> >>  Is it something about how my model trials are configured?  If so,
> >>  what can be done?
> >>
> >>  Cheers,
> >>  Daniel
> >>
> >>
> >>  >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> >>  >>  I started three jobs last night around 1am within a range of about
> 20
> >>  >>  minutes of each other.  They all came back with mpi errors (in the
> >>  >>  pbs error files) relating to netcdf and mnc that read something
> like:
> >>  >>
> >>  >>
> >>  >>  ABNORMAL END: package MNC
> >>  >>  forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> >>  >>  Image              PC        Routine            Line        Source
> >>  >>  mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown
> >>  >>
> >>  >>  Stack trace terminated abnormally.
> >>  >>       p4_error: latest msg from perror: Bad file descriptor
> >>  >>
> >>  >>
> >>  >>  In the pbs standard out file, the problematic part looked like this:
> >>  >>
> >>  >>
> >>  >>    NetCDF ERROR: No such file or directory
> >>  >>    MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> >>  >>  p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  rm_l_1_1848:  p4_error: listener select: -1
> >>  >>  p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
> >>  >>  P4 procgroup file is pr_group.
> >>  >>
> >>  >>
> >>  >>  All the STDERR files are there but of zero size.  The STDOUT files
> >>  >>  have nothing in them of note at the end (just the usual sea ice
> >>  >>  monitor statistic for one of the packages that I am using).
> >>  >>  Something else odd; they all seemed to break down at almost the
> exact
> >>  >>  same time (even though I did not start then all within this close of
> >>  >>  a time):
> >>  >>
> >>  >>
> >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> >>  >>  -rw-------  1 enderton aces 10409734 Dec  2 04:04 
> >>AquaC3O10/AqC3O10_C.e51362
> >>  >>  -rw-------  1 enderton aces 55986184 Dec  2 04:04 
> >>AquaC3O10/AqC3O10_C.o51362
> >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> >>  >>  -rw-------  1 enderton aces 10104455 Dec  2 04:03 
> >>AquaC3O5/AqC3O5_C.e51361
> >>  >>  -rw-------  1 enderton aces 54339284 Dec  2 04:03 
> >>AquaC3O5/AqC3O5_C.o51361
> >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> >>  >>  -rw-------  1 enderton aces  9799175 Dec  2 04:04 
> >>AquaC3O20/AqC3O20_C.e51363
> >>  >>  -rw-------  1 enderton aces 52693228 Dec  2 04:04 
> >>AquaC3O20/AqC3O20_C.o51363
> >>  >
> >>  >
> >>  >Hi Daniel,
> >>  >
> >>  >*Good* bug report!
> >>  >
> >>  >It looks like the kernel ran out of file descriptors.  It does not look
> >>  >like a problem with MITgcm itself [and I'm not just saying that to pass
> >>  >the blame off as the MNC author ;-)]
> >>  >
> >>  >The 4:04am time frame is very suspicious.  Its *right* after the system
> >>  >usually kicks off some cron jobs that update the locate database,
> update
> >  > >whereis, do the pre-linking, etc.  At these times the system can be
> very
> >>  >heavily loaded and seems that it ran out of file descriptors ("file
> >>  >handles").
> >>  >
> >>  >Heres a relevant link from the magic of Google:
> >>  >
> >> 
>
>http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
> >>  >
> >>  >Ed
> >>  >
> >>  >--
> >>  >Edward H. Hill III, PhD
> >>  >office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
> >>  >              Cambridge, MA 02139-4307
> >>  >emails:  eh3 at mit.edu                ed at eh3.com
> >>  >URLs:    http://web.mit.edu/eh3/    http://eh3.com/
> >>  >phone:   617-253-0098
> >>  >fax:     617-253-4464
> >>  >
> >>  >_______________________________________________
> >>  >Aces-support mailing list
> >>  >Aces-support at acesgrid.org
> >>  >http://acesgrid.org/mailman/listinfo/aces-support
> >>
> >>  _______________________________________________
> >>  Aces-support mailing list
> >>  Aces-support at acesgrid.org
> >>  http://acesgrid.org/mailman/listinfo/aces-support
> >
> >_______________________________________________
> >Aces-support mailing list
> >Aces-support at acesgrid.org
> >http://acesgrid.org/mailman/listinfo/aces-support
> 
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>