[Aces-support] Odd mpi errors relating to MNC package

Sat Dec 4 12:14:08 EST 2004

Hi Chinnawat,

 For some reason itrda shutdown at around 10AM today.
 We currently are trying to get a response from the systems team on when
then will be able to get to this. 

 Sorry for the inconvenience.

Chris
On Sat, 2004-12-04 at 12:03, Chinnawat Surussavadee wrote:
> Hello,
> 
> I'm just wondering if anyone knows when itrda will be up again. I have my jobs
> run last night and want to get the results out.
> 
> Thanks,
> 
> Chinnawat
> 
> Quoting Daniel Enderton <enderton at MIT.EDU>:
> 
> > Chris [Ed, and others] --
> > 
> > ITRDA seems to be down right now (not getting anything from ping), 
> > but the three jobs that are currently being run which are cutting out 
> > at 4am (can't see if it happened last night since I can't get in) are 
> > in the following directories:
> > 
> > /net/itrda/scratch-4/enderton/AquaC3O[5,10,20]/
> > 
> > In each of the above three directories there is a runCpl (running the 
> > coupled component) and a runOcn (running the ocean only component) 
> > script.  They are the two scripts that are run (calling the other 
> > when finished), and both encounter the problem.  The coupled scripts 
> > run in the rank_[0-7] directories and the runOcn runs in the ocn_only 
> > directory.
> > 
> > Cheers,
> > Daniel
> > 
> > >Daniel,
> > >
> > >  Can you send a copy of your job script.
> > >
> > >Thanks,
> > >
> > >Chris
> > >On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> > >>  Hey Ed [and others],
> > >>
> > >>  The issue with my jobs cutting out at 4am happened again last night.
> > >>  Has this happened with anyone's else ITRDA jobs?  Should I continue
> > >>  to expect this or this just an issue with getting ITRDA fully online?
> > >>  Is it something about how my model trials are configured?  If so,
> > >>  what can be done?
> > >>
> > >>  Cheers,
> > >>  Daniel
> > >>
> > >>
> > >>  >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> > >>  >>  I started three jobs last night around 1am within a range of about
> > 20
> > >>  >>  minutes of each other.  They all came back with mpi errors (in the
> > >>  >>  pbs error files) relating to netcdf and mnc that read something
> > like:
> > >>  >>
> > >>  >>
> > >>  >>  ABNORMAL END: package MNC
> > >>  >>  forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> > >>  >>  Image              PC        Routine            Line        Source
> > >>  >>  mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown
> > >>  >>
> > >>  >>  Stack trace terminated abnormally.
> > >>  >>       p4_error: latest msg from perror: Bad file descriptor
> > >>  >>
> > >>  >>
> > >>  >>  In the pbs standard out file, the problematic part looked like this:
> > >>  >>
> > >>  >>
> > >>  >>    NetCDF ERROR: No such file or directory
> > >>  >>    MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> > >>  >>  p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  rm_l_1_1848:  p4_error: listener select: -1
> > >>  >>  p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
> > >>  >>  P4 procgroup file is pr_group.
> > >>  >>
> > >>  >>
> > >>  >>  All the STDERR files are there but of zero size.  The STDOUT files
> > >>  >>  have nothing in them of note at the end (just the usual sea ice
> > >>  >>  monitor statistic for one of the packages that I am using).
> > >>  >>  Something else odd; they all seemed to break down at almost the
> > exact
> > >>  >>  same time (even though I did not start then all within this close of
> > >>  >>  a time):
> > >>  >>
> > >>  >>
> > >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> > >>  >>  -rw-------  1 enderton aces 10409734 Dec  2 04:04 
> > >>AquaC3O10/AqC3O10_C.e51362
> > >>  >>  -rw-------  1 enderton aces 55986184 Dec  2 04:04 
> > >>AquaC3O10/AqC3O10_C.o51362
> > >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> > >>  >>  -rw-------  1 enderton aces 10104455 Dec  2 04:03 
> > >>AquaC3O5/AqC3O5_C.e51361
> > >>  >>  -rw-------  1 enderton aces 54339284 Dec  2 04:03 
> > >>AquaC3O5/AqC3O5_C.o51361
> > >>  >>  [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> > >>  >>  -rw-------  1 enderton aces  9799175 Dec  2 04:04 
> > >>AquaC3O20/AqC3O20_C.e51363
> > >>  >>  -rw-------  1 enderton aces 52693228 Dec  2 04:04 
> > >>AquaC3O20/AqC3O20_C.o51363
> > >>  >
> > >>  >
> > >>  >Hi Daniel,
> > >>  >
> > >>  >*Good* bug report!
> > >>  >
> > >>  >It looks like the kernel ran out of file descriptors.  It does not look
> > >>  >like a problem with MITgcm itself [and I'm not just saying that to pass
> > >>  >the blame off as the MNC author ;-)]
> > >>  >
> > >>  >The 4:04am time frame is very suspicious.  Its *right* after the system
> > >>  >usually kicks off some cron jobs that update the locate database,
> > update
> > >  > >whereis, do the pre-linking, etc.  At these times the system can be
> > very
> > >>  >heavily loaded and seems that it ran out of file descriptors ("file
> > >>  >handles").
> > >>  >
> > >>  >Heres a relevant link from the magic of Google:
> > >>  >
> > >> 
> >
> >http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
> > >>  >
> > >>  >Ed
> > >>  >
> > >>  >--
> > >>  >Edward H. Hill III, PhD
> > >>  >office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
> > >>  >              Cambridge, MA 02139-4307
> > >>  >emails:  eh3 at mit.edu                ed at eh3.com
> > >>  >URLs:    http://web.mit.edu/eh3/    http://eh3.com/
> > >>  >phone:   617-253-0098
> > >>  >fax:     617-253-4464
> > >>  >
> > >>  >_______________________________________________
> > >>  >Aces-support mailing list
> > >>  >Aces-support at acesgrid.org
> > >>  >http://acesgrid.org/mailman/listinfo/aces-support
> > >>
> > >>  _______________________________________________
> > >>  Aces-support mailing list
> > >>  Aces-support at acesgrid.org
> > >>  http://acesgrid.org/mailman/listinfo/aces-support
> > >
> > >_______________________________________________
> > >Aces-support mailing list
> > >Aces-support at acesgrid.org
> > >http://acesgrid.org/mailman/listinfo/aces-support
> > 
> > _______________________________________________
> > Aces-support mailing list
> > Aces-support at acesgrid.org
> > http://acesgrid.org/mailman/listinfo/aces-support
> > 
> 
> 
> 
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support