[Aces-support] Odd mpi errors relating to MNC package

Ed Hill ed at eh3.com
Thu Dec 2 11:59:23 EST 2004


On Thu, 2004-12-02 at 11:34 -0500, Daniel Enderton wrote:
> I started three jobs last night around 1am within a range of about 20 
> minutes of each other.  They all came back with mpi errors (in the 
> pbs error files) relating to netcdf and mnc that read something like:
> 
> 
> ABNORMAL END: package MNC
> forrtl: severe (28): CLOSE error, unit 60, file "Unknown"
> Image              PC        Routine            Line        Source
> mitgcmuv.O1        081F75F8  Unknown               Unknown  Unknown
> 
> Stack trace terminated abnormally.
>      p4_error: latest msg from perror: Bad file descriptor
> 
> 
> In the pbs standard out file, the problematic part looked like this:
> 
> 
>   NetCDF ERROR: No such file or directory
>   MNC ERROR: ending define mode in S/R MNC_FILE_ENDDEF
> p4_31316:  p4_error: net_recv read:  probable EOF on socket: 1
> p5_27711:  p4_error: net_recv read:  probable EOF on socket: 1
> p7_16293:  p4_error: net_recv read:  probable EOF on socket: 1
> p3_647:  p4_error: net_recv read:  probable EOF on socket: 1
> p2_12796:  p4_error: net_recv read:  probable EOF on socket: 1
> rm_l_1_1848:  p4_error: listener select: -1
> p6_21651:  p4_error: net_recv read:  probable EOF on socket: 1
> P4 procgroup file is pr_group.
> 
> 
> All the STDERR files are there but of zero size.  The STDOUT files 
> have nothing in them of note at the end (just the usual sea ice 
> monitor statistic for one of the packages that I am using). 
> Something else odd; they all seemed to break down at almost the exact 
> same time (even though I did not start then all within this close of 
> a time):
> 
> 
> [enderton at itrda enderton]$ ls -l AquaC3O10/AqC3O10_C.*
> -rw-------  1 enderton aces 10409734 Dec  2 04:04 AquaC3O10/AqC3O10_C.e51362
> -rw-------  1 enderton aces 55986184 Dec  2 04:04 AquaC3O10/AqC3O10_C.o51362
> [enderton at itrda enderton]$ ls -l AquaC3O5/AqC3O5_C.*
> -rw-------  1 enderton aces 10104455 Dec  2 04:03 AquaC3O5/AqC3O5_C.e51361
> -rw-------  1 enderton aces 54339284 Dec  2 04:03 AquaC3O5/AqC3O5_C.o51361
> [enderton at itrda enderton]$ ls -l AquaC3O20/AqC3O20_C.*
> -rw-------  1 enderton aces  9799175 Dec  2 04:04 AquaC3O20/AqC3O20_C.e51363
> -rw-------  1 enderton aces 52693228 Dec  2 04:04 AquaC3O20/AqC3O20_C.o51363


Hi Daniel,

*Good* bug report!

It looks like the kernel ran out of file descriptors.  It does not look
like a problem with MITgcm itself [and I'm not just saying that to pass
the blame off as the MNC author ;-)]

The 4:04am time frame is very suspicious.  Its *right* after the system
usually kicks off some cron jobs that update the locate database, update
whereis, do the pre-linking, etc.  At these times the system can be very
heavily loaded and seems that it ran out of file descriptors ("file
handles").

Heres a relevant link from the magic of Google:

http://www.complanguages.com/mpirun__net_send_write__bad_file_descriptor-6874840-5500-a.html

Ed

-- 
Edward H. Hill III, PhD
office:  MIT Dept. of EAPS;  Rm 54-1424;  77 Massachusetts Ave.
             Cambridge, MA 02139-4307
emails:  eh3 at mit.edu                ed at eh3.com
URLs:    http://web.mit.edu/eh3/    http://eh3.com/
phone:   617-253-0098
fax:     617-253-4464




More information about the Aces-support mailing list