[MITgcm-devel] sx8 latest testing

Sat Jan 16 15:13:52 EST 2010

Hi Martin,

Sorry, I miss the 2nd message you sent the other day (on Jan 12).

Regarding point 1:
I agree, it would be better to change data.diagnostics to much fewer 
files, go for it. 
Not sure about turning off mnc for standard lab_sea (since it's the only 
forward test with full seaice and MNC). 
If you try also to change data.diagnostics in the same way, would this
work ?

Regarding point 2:
I am ready to tweak testreport for the "NORMAL END" check (looking at this now).
I was also willing to add an option to testreport for not running the model.
This way we could explore the alternative I mentionned last time:
> Regarding the script, may be one day we should try
> an alternative solution, adding an option to testreport to
> stop after make, and then running a 2nd time on the
> compute node (in batch) with option "-q" (just run if
> executable is up-to-date), would save multiple qsub
> and the pain to synchronise.
Do you know if the "make" command will work on compute node if there is 
nothing to do (because the executable mitgcmuv is up to date) ?

Thanks,
Jean-Michel

On Sat, Jan 16, 2010 at 04:05:07PM +0100, Martin Losch wrote:
> Jean-Michel,
> 
> two things:
> 1. what do you think about changing lab_sea output (all mds) and offline_exf_seaice (fewer netcdf files) as described below?
> 
> 2. I am trying to install a testreport on our new ibm-p6 (successor of the edvir). I am having this problem: I cannot compile in the queue, because the compute nodes do not have any compilers, so the argument to the -command flat is "llsubmit jobscript". This works but testreport does recognize the successfull run, because it looks for an "NORMAL END" and "NORMAL END" is not part of "$RUNLOG" (which contains just the result of the llsubmit, for example: "llsubmit: The job "iblade1.awi.de.414" has been submitted.") Do you have any idea, how I could fix that? (I tried -command "llsubmit jobscript && echo NORMAL END" , but somehow the shell swallows the second part.)
> 
> Martin
> On Jan 12, 2010, at 4:05 PM, Martin Losch wrote:
> 
> > Me again,
> > 
> > unrelated to the recent problem, but lab_sea/run and the two offline_exf_seaice experiments fail for the same reason: too many netcdf files are opened. For some reason, opening a netcdf files requires a lot of RAM on this machine and I am already at the limit of 32GB of RAM for my test jobs (larger RAM would mean different queue and unnecessary waiting).
> > 
> > Why are there so many netcdf files? Because in data.diagnostics, there is a file for every variable (and 4 tiles!). While this makes sense if useMNC=.false., it's not useful for netcdf files. I suggest to
> > 1. turn off the mnc package in lab_sea/input/data.pkg (as it is turned of for the other sub-experiments)
> > 2. change "data.diagnostics" for offline_exf_seaice to have only one or two netcdf files opened by the diagnostics package.
> > 
> > Any objection?
> > 
> > Martin
> > 
> > On Jan 12, 2010, at 3:34 PM, Martin Losch wrote:
> > 
> >> Hi Jean-Michel,
> >> compared to Dec06, 2009 these are the extra fails (compiled but did not run):
> >> 
> >> deep_anelastic
> >> fizhi-gridalt-hs
> >> flt_example
> >> global_with_exf (2)
> >> hs94.cs-32x32x5
> >> 
> >> everything else looks pretty much the same.
> >> 
> >> The reason for this is unclear but probably unrelated to the model. All experiments finished regularily and only the comparison is missing. (BTW, some of the other experiments that are missing: dome, global_ocean.cs32x15.icedyn and thsice, are complete, too, so it's the same problem there; the other two fizhi experiments fail as usual, probably seg-fault, and the lab_sea and offline_exf_seaice experiments have a different problem that I have not yet found).
> >> 
> >> Because of cross compiling I need to run testreport on the head node and run the models on with individual qsub-commands. The qsub on this machine does not have a flag to make it return control to the calling shell only after completion of the job, so that I have to make testreport wait for some specific output file to appear before it continues. This is my jobscript:
> >>> x8::scripts> less runit_sxf90
> >>> #!/bin/sh
> >>> # submit the job
> >>> qsub -q sx8-r /home/sx8/mlosch/scripts/job_sxf90
> >>> 
> >>> sleep 10
> >>> stillruns=`qstat -n -u mlosch | grep testsx8`
> >>> # wait until the job is finished; do this by waiting for output.txt to appear
> >>> while [ ! -e output.txt ]
> >>> do
> >>> sleep 10
> >>> stillruns=`qstat -n -u mlosch | grep testsx8`
> >>> echo "output of qstat "${stillruns}x
> >>> if [ "${stillruns}"x = x ] ; then
> >>>  exit
> >>> fi
> >>> done
> >>> #
> >> and in job_sx8f90 I do this:
> >>> #PBS -q sx8-r                            # job queue not neccesary  so far
> >>> #PBS -N testsx8                          # give the job a name
> >>> #PBS -l cpunum_job=2                     # cpus per node
> >>> #PBS -l cputim_job=2:00:00               # time limit
> >>> #PBS -l memsz_job=32gb                   # max accumulated memory, we need this much because of many netcdf files
> >>> #PBS -j o                                # join i/o
> >>> #PBS -S /bin/sh
> >>> #PBS -o /home/sx8/mlosch/out_sxf90                         # o Where to write output
> >>> #
> >>> 
> >>> cd ${PBS_O_WORKDIR}
> >>> (mpirun -np 2 ./mitgcmuv && cp STDOUT.0000 output.txt && echo "NORMAL END" >> run.log) || cp STDOUT.0000 output.txt
> >> 
> >> So it's not pretty and I assume that for some runs it just does not work. To be honest, I don't feel like finding the problem, because it does not have anything to do with the model and I already tried to fix it with the help of the system administrator, but we were not successful.
> >> 
> >> BTW the edvir machine is completely out, and replaced by something called iblade (IBM P6). If we need this platform in our tests, please let me know and I'll try to do something there
> >> 
> >> Martin
> >> 
> >> On Jan 11, 2010, at 6:50 PM, Jean-Michel Campin wrote:
> >> 
> >>> Hi Martin,
> >>> 
> >>> looks like the last testing on sx8 has more "fail" than usually.
> >>> Do you know why ? Something wrong in the code for this
> >>> platform ?
> >>> 
> >>> Thanks,
> >>> Jean-Michel
> >>> 
> >>> _______________________________________________
> >>> MITgcm-devel mailing list
> >>> MITgcm-devel at mitgcm.org
> >>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> >> 
> >> 
> >> _______________________________________________
> >> MITgcm-devel mailing list
> >> MITgcm-devel at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> > 
> > 
> > _______________________________________________
> > MITgcm-devel mailing list
> > MITgcm-devel at mitgcm.org
> > http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel