[MITgcm-support] memory on optim.x => Exit code -5

Tue Jan 6 13:30:01 EST 2009

Hi Constantinos,

I received this from TACC help:
____________________________________________________________________

FROM: Lockman, John
(Concerning ticket No. 166079)

Matt,

32GiB is the maximum that you can use for a serial job on a ranger  
node.  The
error you are receiving is indeed a result of using more than the 32 GiB
allotted on a node.  I'm not sure I understand your idea of spanning  
across
multiple nodes on a serial job, as that would then make your job  
parallel.
Perhaps it would be beneficial for you to parallelize this job so  
that you can
split the work load [and memory] across multiple nodes.

Let me know if I can be of further assistance.

-john
____________________________________________________________________

As I do not wish to take the initiative to parallelize optim.x I will  
look for a remote machine to run it.  I believe my executable needs  
about 50GB to run.  (What is the production run at these days?)  I'll  
ask Bruce about the local machines here....and I guess Stommell will  
work if not....

oh well, I knew this would happen eventually
thanks
-Matt

On Jan 6, 2009, at 10:16 AM, Constantinos Evangelinos wrote:

> On Tuesday 06 January 2009 12:57:56 am Matthew Mazloff wrote:
>
>> I run this job for my smaller setup (much less memory needs to be
>> allocated), and everything goes through smoothly.
>> I run this again with everything the same but for my larger setup and
>> it crashes almost immediately with the error:
>> " Exit code -5 signaled from i142-401.ranger.tacc.utexas.edu Killing
>> remote processes..."
>
> If you do
> size optim.x
> what does it give you?
>
>> Does anyone know what this error means?  If this means I am asking
>> for too much memory (which seems the likely case), does anyone know
>> if there is a way to use 2 nodes (and thus reserve 64GB) for one
>> seriel job on ranger? (http://www.tacc.utexas.edu/services/ 
>> userguides/
>> ranger/)
>
> No - there is no way to see the memory across 2 different nodes  
> without
> rewriting optim.x in a distributed memory fashion.
>
>> Or does anyone have a better idea.  For example, would it be
>> ridiculous to bring the 4GB ecco_c* files over to Stommell (or some
>> other local machine) and run the linesearch there?
>
> Well ross/weddell do not have more than 32GB of RAM either so  
> stommel would be
> the only machine that could do it locally. If you decide to use  
> stommel make
> sure to do the file transfers through ao or geo or itrda and not via
> ross/weddell.
>
> Unfortunately TACC does not have a system with more shared memory  
> than 32GB.
> If you cannot do things in those 32GB then for future growth we  
> have to
> rewrite optim.x.
>
> Constantinos
> -- 
> Dr. Constantinos Evangelinos
> Department of Earth, Atmospheric and Planetary Sciences
> Massachusetts Institute of Technology
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support