[MITgcm-support] memory on optim.x => Exit code -5
Matthew Mazloff
mmazloff at MIT.EDU
Tue Jan 6 13:30:01 EST 2009
Hi Constantinos,
I received this from TACC help:
____________________________________________________________________
FROM: Lockman, John
(Concerning ticket No. 166079)
Matt,
32GiB is the maximum that you can use for a serial job on a ranger
node. The
error you are receiving is indeed a result of using more than the 32 GiB
allotted on a node. I'm not sure I understand your idea of spanning
across
multiple nodes on a serial job, as that would then make your job
parallel.
Perhaps it would be beneficial for you to parallelize this job so
that you can
split the work load [and memory] across multiple nodes.
Let me know if I can be of further assistance.
-john
____________________________________________________________________
As I do not wish to take the initiative to parallelize optim.x I will
look for a remote machine to run it. I believe my executable needs
about 50GB to run. (What is the production run at these days?) I'll
ask Bruce about the local machines here....and I guess Stommell will
work if not....
oh well, I knew this would happen eventually
thanks
-Matt
On Jan 6, 2009, at 10:16 AM, Constantinos Evangelinos wrote:
> On Tuesday 06 January 2009 12:57:56 am Matthew Mazloff wrote:
>
>> I run this job for my smaller setup (much less memory needs to be
>> allocated), and everything goes through smoothly.
>> I run this again with everything the same but for my larger setup and
>> it crashes almost immediately with the error:
>> " Exit code -5 signaled from i142-401.ranger.tacc.utexas.edu Killing
>> remote processes..."
>
> If you do
> size optim.x
> what does it give you?
>
>> Does anyone know what this error means? If this means I am asking
>> for too much memory (which seems the likely case), does anyone know
>> if there is a way to use 2 nodes (and thus reserve 64GB) for one
>> seriel job on ranger? (http://www.tacc.utexas.edu/services/
>> userguides/
>> ranger/)
>
> No - there is no way to see the memory across 2 different nodes
> without
> rewriting optim.x in a distributed memory fashion.
>
>> Or does anyone have a better idea. For example, would it be
>> ridiculous to bring the 4GB ecco_c* files over to Stommell (or some
>> other local machine) and run the linesearch there?
>
> Well ross/weddell do not have more than 32GB of RAM either so
> stommel would be
> the only machine that could do it locally. If you decide to use
> stommel make
> sure to do the file transfers through ao or geo or itrda and not via
> ross/weddell.
>
> Unfortunately TACC does not have a system with more shared memory
> than 32GB.
> If you cannot do things in those 32GB then for future growth we
> have to
> rewrite optim.x.
>
> Constantinos
> --
> Dr. Constantinos Evangelinos
> Department of Earth, Atmospheric and Planetary Sciences
> Massachusetts Institute of Technology
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
More information about the MITgcm-support
mailing list