[Aces-support] Odd mpi errors relating to MNC package
Chinnawat Surussavadee
surusc at MIT.EDU
Sat Dec 4 12:03:13 EST 2004
Hello,
I'm just wondering if anyone knows when itrda will be up again. I have my jobs
run last night and want to get the results out.
Thanks,
Chinnawat
Quoting Daniel Enderton <enderton at MIT.EDU>:
> Chris [Ed, and others] --
>
> ITRDA seems to be down right now (not getting anything from ping),
> but the three jobs that are currently being run which are cutting out
> at 4am (can't see if it happened last night since I can't get in) are
> in the following directories:
>
> /net/itrda/scratch-4/enderton/AquaC3O[5,10,20]/
>
> In each of the above three directories there is a runCpl (running the
> coupled component) and a runOcn (running the ocean only component)
> script. They are the two scripts that are run (calling the other
> when finished), and both encounter the problem. The coupled scripts
> run in the rank_[0-7] directories and the runOcn runs in the ocn_only
> directory.
>
> Cheers,
> Daniel
>
> >Daniel,
> >
> > Can you send a copy of your job script.
> >
> >Thanks,
> >
> >Chris
> >On Fri, 2004-12-03 at 18:27, Daniel Enderton wrote:
> >> Hey Ed [and others],
> >>
> >> The issue with my jobs cutting out at 4am happened again last night.
> >> Has this happened with anyone's else ITRDA jobs? Should I continue
> >> to expect this or this just an issue with getting ITRDA fully online?
> >> Is it something about how my model trials are configured? If so,
> >> what can be done?
> >>
> >> Cheers,
> >> Daniel
> >>
> >>
> >> >On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> >> >> I started three jobs last night around 1am within a range of about
> 20
> >> >> minutes of each other. They all came back with mpi errors (in the
> >> >> pbs error files) relating to netcdf and mnc that read something
> like:
> >> >>
> >> >>
> >> >> ABNORMAL END: package MNC
> >> >> forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> >> >> Image PC Routine Line Source
> >> >> mitgcmuv.O1 081F75F8 Unknown Unknown Unknown
> >> >>
> >> >> Stack trace terminated abnormally.
> >> >> p4_error: latest msg from perror: Bad file descriptor
> >> >>
> >> >>
> >> >> In the pbs standard out file, the problematic part looked like this:
> >> >>
> >> >>
> >> >> NetCDF ERROR: No such file or directory
> >> >> MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> >> >> p4_31316: p4_error: net_recv read: probable EOF on socket: 1
> >> >> p5_27711: p4_error: net_recv read: probable EOF on socket: 1
> >> >> p7_16293: p4_error: net_recv read: probable EOF on socket: 1
> >> >> p3_647: p4_error: net_recv read: probable EOF on socket: 1
> >> >> p2_12796: p4_error: net_recv read: probable EOF on socket: 1
> >> >> rm_l_1_1848: p4_error: listener select: -1
> >> >> p6_21651: p4_error: net_recv read: probable EOF on socket: 1
> >> >> P4 procgroup file is pr_group.
> >> >>
> >> >>
> >> >> All the STDERR files are there but of zero size. The STDOUT files
> >> >> have nothing in them of note at the end (just the usual sea ice
> >> >> monitor statistic for one of the packages that I am using).
> >> >> Something else odd; they all seemed to break down at almost the
> exact
> >> >> same time (even though I did not start then all within this close of
> >> >> a time):
> >> >>
> >> >>
> >> >> [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> >> >> -rw------- 1 enderton aces 10409734 Dec 2 04:04
> >>AquaC3O10/AqC3O10_C.e51362
> >> >> -rw------- 1 enderton aces 55986184 Dec 2 04:04
> >>AquaC3O10/AqC3O10_C.o51362
> >> >> [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> >> >> -rw------- 1 enderton aces 10104455 Dec 2 04:03
> >>AquaC3O5/AqC3O5_C.e51361
> >> >> -rw------- 1 enderton aces 54339284 Dec 2 04:03
> >>AquaC3O5/AqC3O5_C.o51361
> >> >> [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> >> >> -rw------- 1 enderton aces 9799175 Dec 2 04:04
> >>AquaC3O20/AqC3O20_C.e51363
> >> >> -rw------- 1 enderton aces 52693228 Dec 2 04:04
> >>AquaC3O20/AqC3O20_C.o51363
> >> >
> >> >
> >> >Hi Daniel,
> >> >
> >> >*Good* bug report!
> >> >
> >> >It looks like the kernel ran out of file descriptors. It does not look
> >> >like a problem with MITgcm itself [and I'm not just saying that to pass
> >> >the blame off as the MNC author ;-)]
> >> >
> >> >The 4:04am time frame is very suspicious. Its *right* after the system
> >> >usually kicks off some cron jobs that update the locate database,
> update
> > > >whereis, do the pre-linking, etc. At these times the system can be
> very
> >> >heavily loaded and seems that it ran out of file descriptors ("file
> >> >handles").
> >> >
> >> >Heres a relevant link from the magic of Google:
> >> >
> >>
>
>http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
> >> >
> >> >Ed
> >> >
> >> >--
> >> >Edward H. Hill III, PhD
> >> >office: MIT Dept. of EAPS; Rm 54-1424; 77 Massachusetts Ave.
> >> > Cambridge, MA 02139-4307
> >> >emails: eh3 at mit.edu ed at eh3.com
> >> >URLs: http://web.mit.edu/eh3/ http://eh3.com/
> >> >phone: 617-253-0098
> >> >fax: 617-253-4464
> >> >
> >> >_______________________________________________
> >> >Aces-support mailing list
> >> >Aces-support at acesgrid.org
> >> >http://acesgrid.org/mailman/listinfo/aces-support
> >>
> >> _______________________________________________
> >> Aces-support mailing list
> >> Aces-support at acesgrid.org
> >> http://acesgrid.org/mailman/listinfo/aces-support
> >
> >_______________________________________________
> >Aces-support mailing list
> >Aces-support at acesgrid.org
> >http://acesgrid.org/mailman/listinfo/aces-support
>
> _______________________________________________
> Aces-support mailing list
> Aces-support at acesgrid.org
> http://acesgrid.org/mailman/listinfo/aces-support
>
More information about the Aces-support
mailing list