[MITgcm-support] File reading error on XT5

Matthew Mazloff mmazloff at MIT.EDU
Thu Jan 22 15:40:17 EST 2009


Hi David,

Yeah, that error usually only comes on large adjoint runs....for  
whatever reason

It means your mpi barriers are not working properly and basically one  
processor is ahead of another.
One processor has finished reading a namelist param file (e.g.  
ini_parms) and wrote the next scratch file...perhaps data.exf or  
something, while another one is still on iniparms so it goes to read  
the scratch file thinking its data and its data.exf and it crashes.

I have a hack that makes every processor write its own scratch file  
and then no issue.  only drawback is it will fill your directory with  
scratch files

Here's the hack I did that you can use until someone fixes this  
properly.

First #define TARGET_BGL  in ECCO_CPPOPTIONS.h or I guess CPP_OPTIONS  
if you dont use ECCO.   or, as you did before, just compile with - 
DTARGET_CRAYXT

In OPEN_COPY_DATA_FILE add

#include "EESUPPORT.h"

       CHARACTER*(MAX_LEN_FNAM) scrname1
       CHARACTER*(MAX_LEN_FNAM) scrname2

and then incorperate the CMM( ... ) stuff

#if defined (TARGET_BGL) || defined (TARGET_CRAYXT)
CMM(
       WRITE(scrname1,'(3a)') 'scratch',myProcessStr(1:4),'_1'
       WRITE(scrname2,'(3a)') 'scratch',myProcessStr(1:4),'_2'
CMM)
       OPEN(UNIT=scrUnit1,FILE=scrname1,STATUS='UNKNOWN')
       OPEN(UNIT=scrUnit2,FILE=scrname2,STATUS='UNKNOWN')
CMM      OPEN(UNIT=scrUnit1,FILE='scratch1',STATUS='UNKNOWN')
CMM      OPEN(UNIT=scrUnit2,FILE='scratch2',STATUS='UNKNOWN')
#else
       OPEN(UNIT=scrUnit1,STATUS='SCRATCH')
       OPEN(UNIT=scrUnit2,STATUS='SCRATCH')
#endif


You may also find this issue with ini_parms.F and eeset_parms.F so  
you probably want to incorporate the code there too

I think this is your issue....of course, I could be wrong :-)
-Matt



On Jan 22, 2009, at 12:04 PM, David Hebert wrote:

> Hi everyone,
>
> I have recently got some time on a Cray XT5. Every now and then I  
> seem to get this error message I can't seem to figure out...
>
> PGFIO/stdio: No such file or directory
> PGFIO-F-/OPEN/unit=12/error code returned by host stdio - 2.
> In source file open_copy_data_file.f, at line number 616
>
>
> Is this a namelist reading issue? Has anyone else come across this  
> issue? It seems the problem is intermittent and not always  
> replicated. Compiling with -DTARGET_CRAYXT does not seem to fix the  
> issue. Any help/suggestions are appreciated!
>
> Thanks
>
> David
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support




More information about the MITgcm-support mailing list