[MITgcm-support] model blew up but not terminated

Daiwei (David) Wang daiwei at MIT.EDU
Mon Jun 27 10:56:33 EDT 2011


Hi Martin,

Thanks for the idea. For short test runs, it is a reasonably good solution.

Best,
D.

On 06/27/2011 10:46 AM, Martin Losch wrote:
> David,
>
> some machines/compilers stop when they encounter NaN, some just continue, there isn't much I can think of, that can be done. If you increase your monitor output, there is a check that checks for temperatures beginning to look stupid, and then stops. Once NaN's are reached, that check does not work any more. But you don't want too much monitor output (speed), so that's not really a good solution.
>
> Martin
>
> On Jun 27, 2011, at 4:09 PM, Daiwei (David) Wang wrote:
>
>> Hi,
>>
>> I found a few jobs of mine on beagle blew up, but kept running. By blowup, I mean monitor statistics values became populated by NaN, for example,
>>
>> $ grep dynstat_eta STDOUT.0000
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =   1.2697030002398E+00
>> (PID.TID 0000.0001) %MON dynstat_eta_min              =  -1.9409175554150E+00
>> (PID.TID 0000.0001) %MON dynstat_eta_mean             =  -1.0404669740582E-16
>> (PID.TID 0000.0001) %MON dynstat_eta_sd               =   6.2681413921827E-01
>> (PID.TID 0000.0001) %MON dynstat_eta_del2             =   1.1009328164515E-04
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =   1.0775823955686E+00
>> (PID.TID 0000.0001) %MON dynstat_eta_min              =  -1.9467522542860E+00
>> (PID.TID 0000.0001) %MON dynstat_eta_mean             =  -2.9479897598316E-16
>> (PID.TID 0000.0001) %MON dynstat_eta_sd               =   6.2739266263267E-01
>> (PID.TID 0000.0001) %MON dynstat_eta_del2             =   1.1075534674701E-04
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_min              =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_mean             =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_sd               =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_del2             =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_min              =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_mean             =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_sd               =  NaN
>> (PID.TID 0000.0001) %MON dynstat_eta_del2             =  NaN
>>
>> But the job kept running, in vain of course, until endTime. I wonder if there is a flag to stop the run and write something to standard error when NaN appears. I didn't found one.
>>
>> Thanks,
>> David
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list