[Aces-support] Odd mpi errors relating to MNC package
chris hill
cnh at mit.edu
Sat Dec 4 12:14:08 EST 2004
Hi Chinnawat,
For some reason itrda shutdown at around 10AM today.
We currently are trying to get a response from the systems team on when
then will be able to get to this.
Sorry for the inconvenience.
Chris
On Sat, 2004-12-04 at 12:03, Chinnawat Surussavadee wrote:
> Hello,
>
> I'm just wondering if anyone knows when itrda will be up again. I have my jobs
> run last night and want to get the results out.
>
> Thanks,
>
> Chinnawat
>
> Quoting Daniel Enderton <enderton at MIT.EDU>:
>
> > Chris [Ed, and others] --
> >
> > ITRDA seems to be down right now (not getting anything from ping),
> > but the three jobs that are currently being run which are cutting out
> > at 4am (can't see if it happened last night since I can't get in) are
> > in the following directories:
> >
> > /net/itrda/scratch-4/enderton/AquaC3O[5,10,20]/
> >
> > In each of the above three directories there is a runCpl (running the
> > coupled component) and a runOcn (running the ocean only component)
> > script. They are the two scripts that are run (calling the other
> > when finished), and both encounter the problem. The coupled scripts
> > run in the rank_[0-7] directories and the runOcn runs in the ocn_only
> > directory.
> >
> > Cheers,
> > Daniel
> >
> > >Daniel,
> > >
> > > Can you send a copy of your job script.
> > >
> > >Thanks,
> > >
> > >Chris
> > >On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> > >> Hey Ed [and others],
> > >>
> > >> The issue with my jobs cutting out at 4am happened again last night.
> > >> Has this happened with anyone's else ITRDA jobs? Should I continue
> > >> to expect this or this just an issue with getting ITRDA fully online?
> > >> Is it something about how my model trials are configured? If so,
> > >> what can be done?
> > >>
> > >> Cheers,
> > >> Daniel
> > >>
> > >>
> > >> >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> > >> >> I started three jobs last night around 1am within a range of about
> > 20
> > >> >> minutes of each other. They all came back with mpi errors (in the
> > >> >> pbs error files) relating to netcdf and mnc that read something
> > like:
> > >> >>
> > >> >>
> > >> >> ABNORMAL END: package MNC
> > >> >> forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> > >> >> Image PC Routine Line Source
> > >> >> mitgcmuv.O1 081F75F8 Unknown Unknown Unknown
> > >> >>
> > >> >> Stack trace terminated abnormally.
> > >> >> p4_error: latest msg from perror: Bad file descriptor
> > >> >>
> > >> >>
> > >> >> In the pbs standard out file, the problematic part looked like this:
> > >> >>
> > >> >>
> > >> >> NetCDF ERROR: No such file or directory
> > >> >> MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> > >> >> p4_31316: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> p5_27711: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> p7_16293: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> p3_647: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> p2_12796: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> rm_l_1_1848: p4_error: listener select: -1
> > >> >> p6_21651: p4_error: net_recv read: probable EOF on socket: 1
> > >> >> P4 procgroup file is pr_group.
> > >> >>
> > >> >>
> > >> >> All the STDERR files are there but of zero size. The STDOUT files
> > >> >> have nothing in them of note at the end (just the usual sea ice
> > >> >> monitor statistic for one of the packages that I am using).
> > >> >> Something else odd; they all seemed to break down at almost the
> > exact
> > >> >> same time (even though I did not start then all within this close of
> > >> >> a time):
> > >> >>
> > >> >>
> > >> >> [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> > >> >> -rw------- 1 enderton aces 10409734 Dec 2 04:04
> > >>AquaC3O10/AqC3O10_C.e51362
> > >> >> -rw------- 1 enderton aces 55986184 Dec 2 04:04
> > >>AquaC3O10/AqC3O10_C.o51362
> > >> >> [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> > >> >> -rw------- 1 enderton aces 10104455 Dec 2 04:03
> > >>AquaC3O5/AqC3O5_C.e51361
> > >> >> -rw------- 1 enderton aces 54339284 Dec 2 04:03
> > >>AquaC3O5/AqC3O5_C.o51361
> > >> >> [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> > >> >> -rw------- 1 enderton aces 9799175 Dec 2 04:04
> > >>AquaC3O20/AqC3O20_C.e51363
> > >> >> -rw------- 1 enderton aces 52693228 Dec 2 04:04
> > >>AquaC3O20/AqC3O20_C.o51363
> > >> >
> > >> >
> > >> >Hi Daniel,
> > >> >
> > >> >*Good* bug report!
> > >> >
> > >> >It looks like the kernel ran out of file descriptors. It does not look
> > >> >like a problem with MITgcm itself [and I'm not just saying that to pass
> > >> >the blame off as the MNC author ;-)]
> > >> >
> > >> >The 4:04am time frame is very suspicious. Its *right* after the system
> > >> >usually kicks off some cron jobs that update the locate database,
> > update
> > > > >whereis, do the pre-linking, etc. At these times the system can be
> > very
> > >> >heavily loaded and seems that it ran out of file descriptors ("file
> > >> >handles").
> > >> >
> > >> >Heres a relevant link from the magic of Google:
> > >> >
> > >>
> >
> >http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
> > >> >
> > >> >Ed
> > >> >
> > >> >--
> > >> >Edward H. Hill III, PhD
> > >> >office: MIT Dept. of EAPS; Rm 54-1424; 77 Massachusetts Ave.
> > >> > Cambridge, MA 02139-4307
> > >> >emails: eh3 at mit.edu ed at eh3.com
> > >> >URLs: http://web.mit.edu/eh3/ http://eh3.com/
> > >> >phone: 617-253-0098
> > >> >fax: 617-253-4464
> > >> >
> > >> >_______________________________________________
> > >> >Aces-support mailing list
> > >> >Aces-support at acesgrid.org
> > >> >http://acesgrid.org/mailman/listinfo/aces-support
> > >>
> > >> _______________________________________________
> > >> Aces-support mailing list
> > >> Aces-support at acesgrid.org
> > >> http://acesgrid.org/mailman/listinfo/aces-support
> > >
> > >_______________________________________________
> > >Aces-support mailing list
> > >Aces-support at acesgrid.org
> > >http://acesgrid.org/mailman/listinfo/aces-support
> >
> > _______________________________________________
> > Aces-support mailing list
> > Aces-support at acesgrid.org
> > http://acesgrid.org/mailman/listinfo/aces-support
> >
>
>
>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
More information about the Aces-support
mailing list