[MITgcm-support] jobs died suddenly

Yangxin He y67he at uwaterloo.ca
Wed Mar 25 22:10:57 EDT 2020


Hi Matt,


so for the sudden crash of the run, do you think the problem is in the code or it may be something with graham (compute Canada)?


Yangxin

________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org> on behalf of Matthew Mazloff <mmazloff at ucsd.edu>
Sent: Wednesday, March 25, 2020 4:28:30 PM
To: mitgcm-support at mitgcm.org
Subject: Re: [MITgcm-support] jobs died suddenly

That is a separate issue. The model crashed but the HPC didn’t stop the job. I don’t know how to remedy that and the HPC support should be able to help with that.
The model was not running. The executable had stopped and that is your primary issue. I am not sure why the model crashed, but my first guess is it happened while trying to read an OBW or OBE file.

Matt


On Mar 25, 2020, at 1:22 PM, Yangxin He <y67he at uwaterloo.ca<mailto:y67he at uwaterloo.ca>> wrote:

Hi Matt,

Yep. This is part of my data file:
#obcs forcing
 periodicExternalForcing=.TRUE.,
 externForcingPeriod=36.,
 externForcingCycle=86148.,

 &

Apart from this, Enrico has the same problem as to run not producing files but still running.
I submitted a ticket to graham compute Canada, they did not know why and suggested me to try here.

Yangxin
________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org<mailto:mitgcm-support-bounces at mitgcm.org>> on behalf of Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>>
Sent: Wednesday, March 25, 2020 4:17:53 PM
To: mitgcm-support at mitgcm.org<mailto:mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] jobs died suddenly

Well it definitely died while trying to read something for the obcs:
ABNORMAL END: S/R MDS_READ_SEC_YZ

Do you also give boundary files that have a start time and period given in data.exf?
E.g.:
 obcsWstartdate1     = 20081216,
 obcsWstartdate2     = 00000,
 obcsWperiod         = 2629800,

Matt


On Mar 25, 2020, at 1:11 PM, Yangxin He <y67he at uwaterloo.ca<mailto:y67he at uwaterloo.ca>> wrote:

Hi Matt,

This would be really confusing.
My file seems to be the right size, and the run died after running fine for 34 tidal periods. If the size of boundary files is the problem, then the run would have died in the beginning?
Another thing is, I have been using this set up for about a year now, and it was running fine only until recently.

Yangxin
________________________________
From: MITgcm-support <mitgcm-support-bounces at mitgcm.org<mailto:mitgcm-support-bounces at mitgcm.org>> on behalf of Matthew Mazloff <mmazloff at ucsd.edu<mailto:mmazloff at ucsd.edu>>
Sent: Wednesday, March 25, 2020 4:06:38 PM
To: mitgcm-support at mitgcm.org<mailto:mitgcm-support at mitgcm.org>
Subject: Re: [MITgcm-support] jobs died suddenly

Hello

The code crashed trying to read a file. The file is size NY*NZ*NT so I suspect it is an eastern or western boundary condition file. Make sure your files are long enough.

-Matt


On Mar 25, 2020, at 12:24 PM, Yangxin He <y67he at uwaterloo.ca<mailto:y67he at uwaterloo.ca>> wrote:

Hello there,

Recently several jobs of mine died of no reason. The error message is
[y67he at gra-login1 b6]$ more sim-29315632.err
ABNORMAL END: S/R MDS_READ_SEC_YZ
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** JOB 29315632 ON gra228 CANCELLED AT 2020-03-25T08:30:08 DUE TO TIME LIMIT ***

slurmstepd: error: *** STEP 29315632.0 ON gra228 CANCELLED AT 2020-03-25T08:30:08 DUE TO TIME LIMIT ***
The time limit was not the problem. The code simply stopped producing any new results, however, it was still running.
This is confusing, because I have been using the same set up for a while and this only started to happen in the past few weeks.

I ran my code on graham in compute Canada, and the people there suggested it may be the problem in the code.
Can anyone shed any lights on this?

Thanks

Yangxin
_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support

_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support

_______________________________________________
MITgcm-support mailing list
MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
http://mailman.mitgcm.org/mailman/listinfo/mitgcm-support

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.mitgcm.org/pipermail/mitgcm-support/attachments/20200326/05006115/attachment-0001.html>


More information about the MITgcm-support mailing list