[MITgcm-support] jobs died suddenly

Pochini, Enrico epochini at inogs.it
Wed Mar 25 16:14:56 EDT 2020


Hi,

Replying to this:

> The code simply stopped producing any new results, however, it was still
> running
>

I have encountered this issue as well, when running on a cluster with SLURM
queue management (batch script to sbmit job). The error is called and
written in output files, but execution is not stopped (only when time limit
is reached).
The model continue running and burning core hours for nothing.

Enrico

Il giorno mer 25 mar 2020 alle ore 21:06 Matthew Mazloff <mmazloff at ucsd.edu>
ha scritto:

> Hello
>
> The code crashed trying to read a file. The file is size NY*NZ*NT so I
> suspect it is an eastern or western boundary condition file. Make sure your
> files are long enough.
>
> -Matt
>
>
> On Mar 25, 2020, at 12:24 PM, Yangxin He <y67he at uwaterloo.ca> wrote:
>
> Hello there,
>
> Recently several jobs of mine died of no reason. The error message is
> [y67he at gra-login1 b6]$ more sim-29315632.err
> ABNORMAL END: S/R MDS_READ_SEC_YZ
> srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
> slurmstepd: error: *** JOB 29315632 ON gra228 CANCELLED AT
> 2020-03-25T08:30:08 DUE TO TIME LIMIT ***
>
> slurmstepd: error: *** STEP 29315632.0 ON gra228 CANCELLED AT
> 2020-03-25T08:30:08 DUE TO TIME LIMIT ***
> The time limit was not the problem. The code simply stopped producing any
> new results, however, it was still running.
> This is confusing, because I have been using the same set up for a while
> and this only started to happen in the past few weeks.
>
> I ran my code on graham in compute Canada, and the people there suggested
> it may be the problem in the code.
> Can anyone shed any lights on this?
>
> Thanks
>
> Yangxin
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200325/4a5b528e/attachment-0001.html>


More information about the MITgcm-support mailing list