[MITgcm-support] CRAY XD1

Martin Losch mlosch at awi-bremerhaven.de
Thu May 12 17:20:55 EDT 2005


Chris,

there has been some progress on the CRAY XD1 problem. The problem: on 
our XD1 the model would not run with MPI (segmentation fault), but 
after removing the -r8 compiler flag I could run two different 
configurations on up to 30 and 32CPUs (linux_amd64_pgf77+mpi_xd1). The 
next step would have been 50 and 64CPUs, but with these, I got 
segmenations faults again.
The cray people (to whom I am CC'ing this email) found out that a write 
statement in mon_out.F make the 50CPU run go through, pointing towards 
a stacksize problem. This is what they did:
> diff -c mon_out.F_orig mon_out.F
> *** mon_out.F_orig      Tue May 10 16:45:34 2005
> --- mon_out.F   Tue May 10 16:46:03 2005
> ***************
> *** 194,199 ****
> --- 194,200 ----
>  #endif /* ALLOW_USE_MPI */
>
>            IF (mon_write_stdout) THEN
> +           write(0,*) 'itype =',itype
>              IF (itype .EQ. 1)
>       &           WRITE(msgBuf(36:57),'(1X,I21)')       ival
>
>              IF (itype .EQ. 2)

I can reproduce that. This is again close to a "CALL PRINT_MESSAGE" 
statement (as the original segmentation fault, see below in my initial 
email), so I was wondering whether this has to do with that routine. I 
commented out the entire body of PRINT_MESSAGE, but that did not do the 
trick. Do you have any idea what may be going on?

Martin

PS. Still no test accout for you, but I'll keep trying.

On Apr 5, 2005, at 1:13 PM, chris hill wrote:

> Hi Martin,
>
>  If you can get us an account that would be great.
>
> Thanks,
>
> Chris
> Martin Losch wrote:
>> Chris,
>> I had help from the CRAY support, and they could get rid of the 
>> problem by removing the -r8 option for the pgf77 compiler. Now I only 
>> get segmentation faults if I the number of CPUs is too large 
>> (currently 50 is a problem). But the CRAY people can reproduce this 
>> and promised to find the problem. I assume that there's a bug in the 
>> MPI implementation (MPICH). In the meantime, I have asked for test 
>> accounts (again), in case you are interested, but not success so far 
>> (neverending story). I'll keep you posted.
>> Martin
>> On Apr 1, 2005, at 3:17 PM, Chris Hill wrote:
>>> Hi Martin,
>>>
>>>  Did you get anywhere on this (I can't see a reply but I have been 
>>> away).
>>>  On thing that can make this symptom happen is stacksize problems in 
>>> a shell. Can you set the shell stacksize to "unlimited" for your 
>>> shell and see what happens. I can't remember the command for this 
>>> and I'm not properly on line - its something like "ulimit" I think.
>>>
>>> Chris
>>> Martin Losch wrote:
>>>
>>>> Hi,
>>>> we have a new platform, a CRAY XD1, (basically opteron CPUs with a 
>>>> very fast network). I have a hard time on this machine:
>>>> I was able to run a global_ocean experiment (180x80x23, the ecco 
>>>> 2deg configuration, with KPP, GMRedi, etc.) with up to 12CPUs, 
>>>> scales and everything, but I have a different configuration 
>>>> (1500x1x100, or 256x256x100,  cartesian coordinates, 
>>>> non-hydrostatic, but no other extra packages, no netcdf, runs fine 
>>>> on many platforms with up to 20CPUs so far), which ends with a 
>>>> segmentation fault in the routine packages_boot.F (actually, while 
>>>> printing a message with print_message). Works only with MPI turned 
>>>> off.
>>>> For these two cases I use the same build_options file 
>>>> (linux_amd64_pgf77+mpi_xd1).
>>>> Any idea what might be going on?
>>>> Martin
>>>> _______________________________________________
>>>> MITgcm-support mailing list
>>>> MITgcm-support at mitgcm.org
>>>> http://dev.mitgcm.org/mailman/listinfo/mitgcm-support
>>>
>>>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org
>>> http://dev.mitgcm.org/mailman/listinfo/mitgcm-support
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list