[MITgcm-support] problem with nobackupp8
Menemenlis, Dimitris (3248)
Dimitris.Menemenlis at jpl.nasa.gov
Tue Jan 14 12:45:34 EST 2014
Hi Matt and thanks for getting back.
I transfer back to MITgcm support list then, since it may after all be useful to others.
In my case, the problem was a temporary problem with lustre file system.
The mds (meta data server) had unmounted. I resubmitted after lustre system
was fixed and job is running fine now.
> Hi Dimitris
>
> I have got this same error before -- a node is likely segment faulting due to your memory load. Judging by your subject you already know this….
>
> Its hard to diagnose exactly what the memory limits are because the node may sometimes not crash with the same load. After experimenting on Stampede I find it safe to not utilize more than 80% of the available memory per processor.
>
> It would be nice, at some point, to discuss our trials and tribulations of dealing with large runs…let me know if you want to skype about it. (and unfortunately I wont be at the ECCO meeting this year…)
>
> Matt
> On Jan 14, 2014, at 8:58 AM, Menemenlis, Dimitris (3248) wrote:
>
>> The symptoms of problem I just reported by phone, for interactive jobid 3154859
>> is that every time I attempt to start model, I get following error message and it
>> aborts: forrtl: Cannot send after transport endpoint shutdown
More information about the MITgcm-support
mailing list