[MITgcm-devel] sx8 testing

Martin Losch Martin.Losch at awi.de
Tue May 5 02:59:52 EDT 2009


Hi Jean-Michel,

finally I managed to have a quick look at the sx8 tests. I managed to  
get them going again (I had to remove the automatic restart test that  
was somehow stalling and I had to modify my hack for making the script  
wait for the "qsubbed" job to finish), but now there are still a few  
failures, which I do not understand. Most of them may be related to my  
hacks for testing but some are not, here are some error messages and  
comments from yesterdays test, maybe you have an idea, what's going on  
with the first one.
Cheers,
Martin

> sx8::verification> cat adjustment.cs-32x32x1/tr_run.nlfs/STDERR.*
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #     4  
> (bi,bj=   4,   1 ):
> (PID.TID 0000.0001) *** ERROR *** E.Edge has    8 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #     6  
> (bi,bj=   6,   1 ):
> (PID.TID 0000.0001) *** ERROR *** E.Edge has    8 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #     9  
> (bi,bj=   9,   1 ):
> (PID.TID 0000.0001) *** ERROR *** N.Edge has   16 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #    10  
> (bi,bj=  10,   1 ):
> (PID.TID 0000.0001) *** ERROR *** N.Edge has   16 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #    15  
> (bi,bj=  11,   1 ):
> (PID.TID 0000.0001) *** ERROR *** S.Edge has   16 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #    16  
> (bi,bj=  12,   1 ):
> (PID.TID 0000.0001) *** ERROR *** S.Edge has   16 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #    25  
> (bi,bj=  21,   1 ):
> (PID.TID 0000.0001) *** ERROR *** S.Edge has    7 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** EXCH2_CHECK_DEPTHS: tile #    26  
> (bi,bj=  22,   1 ):
> (PID.TID 0000.0001) *** ERROR *** S.Edge has    7 unconnected points  
> with non-zero depth.
> (PID.TID 0000.0001) *** ERROR *** S/R EXCH2_CHECK_DEPTHS: Fatal Error
> (PID.TID 0000.0001) *** ERROR *** occurs    1 time(s) among all  
> Threads and Procs
> (PID.TID 0001.0001) *** ERROR *** occurs    1 time(s) among all  
> Threads and Procs
>
>
> aim.5l_cs:
> sx8::run> cat /home/sx8/mlosch/out_sxf90
> MPI process (universe 0, rank 0) terminated by signal(9); Kill
> sx8-2: mpid: MPI process terminated by signal(9)
> MPI process (universe 0, rank 1) terminated by signal(9); Kill
> sx8-2: mpid: MPI process terminated by signal(9)
>
>
> fizhi-cs-32x32x40:
> unclear, but was never OK
>
> global_ocean.cs32x15.icedyn, global_ocean.cs32x15.thsice:
> complete with output.txt, no idea what went wrong, something in my  
> testing scheme
>
> lab_sea:
> cat STDERR.000*
> (PID.TID 0001.0001) *** ERROR *** NetCDF ERROR:
> (PID.TID 0001.0001) *** ERROR *** MNC ERROR: opening 'mnc_test_0001/ 
> phiHydLow.0000000000.t004.nc'
> most likely that has to do with memory issues (for some reason the  
> SX8 allocates a lot of memory,
> just for opening a netcdf file. If you have too many netcdf files,  
> than you need a lot memory,
> so that the 32GB I ask for are not enough).
>
>
On Apr 15, 2009, at 8:57 AM, Martin Losch wrote:

> Hi Jean-Michel,
>
> there have been a few problems with the SX8: a file system had  
> crashed and there was some temporary rearrangement of the remaining  
> systems, in particular a scratch system that I use for the tests was  
> probably not available all the time.
>
> What happened last weekend is not clear to me, but I can compile  
> these experiments by hand that failed during testreport. It may have  
> to do with my scripting (which I haven't changed, but maybe the  
> machine changed). I'll rerun testreport by hand and then we'll see  
> what happens.
>
> Martin
> On Apr 14, 2009, at 8:05 PM, Jean-Michel Campin wrote:
>
>> Hi Martin,
>>
>> Welcome back !
>> I don't know what happened to the sx8 testing (looks like we missed
>> one in late March and an other beginning of April) and the latest
>> has a sequence of fails in the middle of the list.
>> I made some changes in genmake2 and I hope it's not causing
>> those problems.
>>
>> And just a comment regarding the pkg/seaice stuff: I was feelling
>> a little embarassed when I insisted to have the old code still
>> available with a CPP flag, but now I don't regret it too much.
>>
>> Cheers,
>> Jean-Michel
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel




More information about the MITgcm-devel mailing list