[MITgcm-support] problem with nobackupp8

Tue Jan 14 12:45:34 EST 2014

Hi Matt and thanks for getting back.
I transfer back to MITgcm support list then, since it may after all be useful to others.
In my case, the problem was a temporary problem with lustre file system.
The mds (meta data server) had unmounted.  I resubmitted after lustre system
was fixed and job is running fine now.

> Hi Dimitris
> 
> I have got this same error before -- a node is likely segment faulting due to your memory load. Judging by your subject you already know this….
> 
> Its hard to diagnose exactly what the memory limits are because the node may sometimes not crash with the same load. After experimenting on Stampede I find it safe to not utilize more than 80% of the available memory per processor.
> 
> It would be nice, at some point, to discuss our trials and tribulations of dealing with large runs…let me know if you want to skype about it. (and unfortunately I wont be at the ECCO meeting this year…)
> 
> Matt

> On Jan 14, 2014, at 8:58 AM, Menemenlis, Dimitris (3248) wrote:
> 
>> The symptoms of problem I just reported by phone, for interactive jobid 3154859
>> is that every time I attempt to start model, I get following error message and it
>> aborts:  forrtl: Cannot send after transport endpoint shutdown