[Aces-support] Odd mpi errors relating to MNC package
Ed Hill
ed at eh3.com
Thu Dec 2 11:59:23 EST 2004
On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> I started three jobs last night around 1am within a range of about 20
> minutes of each other. They all came back with mpi errors (in the
> pbs error files) relating to netcdf and mnc that read something like:
>
>
> ABNORMAL END: package MNC
> forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> Image PC Routine Line Source
> mitgcmuv.O1 081F75F8 Unknown Unknown Unknown
>
> Stack trace terminated abnormally.
> p4_error: latest msg from perror: Bad file descriptor
>
>
> In the pbs standard out file, the problematic part looked like this:
>
>
> NetCDF ERROR: No such file or directory
> MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> p4_31316: p4_error: net_recv read: probable EOF on socket: 1
> p5_27711: p4_error: net_recv read: probable EOF on socket: 1
> p7_16293: p4_error: net_recv read: probable EOF on socket: 1
> p3_647: p4_error: net_recv read: probable EOF on socket: 1
> p2_12796: p4_error: net_recv read: probable EOF on socket: 1
> rm_l_1_1848: p4_error: listener select: -1
> p6_21651: p4_error: net_recv read: probable EOF on socket: 1
> P4 procgroup file is pr_group.
>
>
> All the STDERR files are there but of zero size. The STDOUT files
> have nothing in them of note at the end (just the usual sea ice
> monitor statistic for one of the packages that I am using).
> Something else odd; they all seemed to break down at almost the exact
> same time (even though I did not start then all within this close of
> a time):
>
>
> [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> -rw------- 1 enderton aces 10409734 Dec 2 04:04 AquaC3O10/AqC3O10_C.e51362
> -rw------- 1 enderton aces 55986184 Dec 2 04:04 AquaC3O10/AqC3O10_C.o51362
> [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> -rw------- 1 enderton aces 10104455 Dec 2 04:03 AquaC3O5/AqC3O5_C.e51361
> -rw------- 1 enderton aces 54339284 Dec 2 04:03 AquaC3O5/AqC3O5_C.o51361
> [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> -rw------- 1 enderton aces 9799175 Dec 2 04:04 AquaC3O20/AqC3O20_C.e51363
> -rw------- 1 enderton aces 52693228 Dec 2 04:04 AquaC3O20/AqC3O20_C.o51363
Hi Daniel,
*Good* bug report!
It looks like the kernel ran out of file descriptors. It does not look
like a problem with MITgcm itself [and I'm not just saying that to pass
the blame off as the MNC author ;-)]
The 4:04am time frame is very suspicious. Its *right* after the system
usually kicks off some cron jobs that update the locate database, update
whereis, do the pre-linking, etc. At these times the system can be very
heavily loaded and seems that it ran out of file descriptors ("file
handles").
Heres a relevant link from the magic of Google:
http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html
Ed
--
Edward H. Hill III, PhD
office: MIT Dept. of EAPS; Rm 54-1424; 77 Massachusetts Ave.
Cambridge, MA 02139-4307
emails: eh3 at mit.edu ed at eh3.com
URLs: http://web.mit.edu/eh3/ http://eh3.com/
phone: 617-253-0098
fax: 617-253-4464
More information about the Aces-support
mailing list