[MITgcm-support] jobs died suddenly

Yangxin He y67he at uwaterloo.ca
Wed Mar 25 15:24:00 EDT 2020


Hello there,


Recently several jobs of mine died of no reason. The error message is

[y67he at gra-login1 b6]$ more sim-29315632.err

ABNORMAL END: S/R MDS_READ_SEC_YZ

srun: Job step aborted: Waiting up to 62 seconds for job step to finish.

slurmstepd: error: *** JOB 29315632 ON gra228 CANCELLED AT 2020-03-25T08:30:08 DUE TO TIME LIMIT ***

slurmstepd: error: *** STEP 29315632.0 ON gra228 CANCELLED AT 2020-03-25T08:30:08 DUE TO TIME LIMIT ***

The time limit was not the problem. The code simply stopped producing any new results, however, it was still running.

This is confusing, because I have been using the same set up for a while and this only started to happen in the past few weeks.


I ran my code on graham in compute Canada, and the people there suggested it may be the problem in the code.

Can anyone shed any lights on this?


Thanks


Yangxin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200325/831d0327/attachment.html>


More information about the MITgcm-support mailing list