[MITgcm-devel] sx8 latest testing

Tue Jan 12 15:33:32 EST 2010

Hi Martin,

OK, it's reassuring.
Regarding the script, may be one day we should try
an alternative solution, adding an option to testreport to 
stop after make, and then running a 2nd time on the
compute node (in batch) with option "-q" (just run if
executable is up-to-date), would save multiple qsub
and the pain to synchronise.

Just a detail: did you try to add 2 other "sleep 5"
in runit_sxf90, 1 just before "exit" and the 2nd at the end ?
Was thinking that if nfs/disk syncro is not instantaneous,
the front node might "see" output.txt but not yet the content 
(or the content of STDOUT.0000 if you use testreport -mpi).
That would explain few "missing" output from time to time.

Cheers,
Jean-Michel

On Tue, Jan 12, 2010 at 03:34:28PM +0100, Martin Losch wrote:
> Hi Jean-Michel,
> compared to Dec06, 2009 these are the extra fails (compiled but did not 
> run):
>
> deep_anelastic
> fizhi-gridalt-hs
> flt_example
> global_with_exf (2)
> hs94.cs-32x32x5
>
> everything else looks pretty much the same.
>
> The reason for this is unclear but probably unrelated to the model. All 
> experiments finished regularily and only the comparison is missing. (BTW, 
> some of the other experiments that are missing: dome,  
> global_ocean.cs32x15.icedyn and thsice, are complete, too, so it's the  
> same problem there; the other two fizhi experiments fail as usual,  
> probably seg-fault, and the lab_sea and offline_exf_seaice experiments  
> have a different problem that I have not yet found).
>
> Because of cross compiling I need to run testreport on the head node and 
> run the models on with individual qsub-commands. The qsub on this  
> machine does not have a flag to make it return control to the calling  
> shell only after completion of the job, so that I have to make  
> testreport wait for some specific output file to appear before it  
> continues. This is my jobscript:
>> x8::scripts> less runit_sxf90
>> #!/bin/sh
>> # submit the job
>> qsub -q sx8-r /home/sx8/mlosch/scripts/job_sxf90
>>
>> sleep 10
>> stillruns=`qstat -n -u mlosch | grep testsx8`
>> # wait until the job is finished; do this by waiting for output.txt to 
>> appear
>> while [ ! -e output.txt ]
>> do
>>   sleep 10
>>   stillruns=`qstat -n -u mlosch | grep testsx8`
>>   echo "output of qstat "${stillruns}x
>>   if [ "${stillruns}"x = x ] ; then
>>    exit
>>   fi
>> done
>> #
> and in job_sx8f90 I do this:
>> #PBS -q sx8-r                            # job queue not neccesary  so 
>> far
>> #PBS -N testsx8                          # give the job a name
>> #PBS -l cpunum_job=2                     # cpus per node
>> #PBS -l cputim_job=2:00:00               # time limit
>> #PBS -l memsz_job=32gb                   # max accumulated memory, we 
>> need this much because of many netcdf files
>> #PBS -j o                                # join i/o
>> #PBS -S /bin/sh
>> #PBS -o /home/sx8/mlosch/out_sxf90                         # o Where  
>> to write output
>> #
>>
>> cd ${PBS_O_WORKDIR}
>> (mpirun -np 2 ./mitgcmuv && cp STDOUT.0000 output.txt && echo "NORMAL 
>> END" >> run.log) || cp STDOUT.0000 output.txt
>
> So it's not pretty and I assume that for some runs it just does not  
> work. To be honest, I don't feel like finding the problem, because it  
> does not have anything to do with the model and I already tried to fix  
> it with the help of the system administrator, but we were not  
> successful.
>
> BTW the edvir machine is completely out, and replaced by something  
> called iblade (IBM P6). If we need this platform in our tests, please  
> let me know and I'll try to do something there
>
> Martin
>
> On Jan 11, 2010, at 6:50 PM, Jean-Michel Campin wrote:
>
>> Hi Martin,
>>
>> looks like the last testing on sx8 has more "fail" than usually.
>> Do you know why ? Something wrong in the code for this
>> platform ?
>>
>> Thanks,
>> Jean-Michel
>>
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>
>
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel