[MITgcm-devel] regression test with gfortran v10

Jean-Michel Campin jmc at mit.edu
Mon Jan 4 19:00:35 EST 2021


Hi,

Scott (Blomquist) upgraded VM "batsi.mit.edu" with latest Fedora (fc33) which comes
with recent version of gcc/gfortran : 10.2.1

I had to add an other Compiler Flag to optfile "linux_amd64_gfortran", namely:
 -fallow-argument-mismatch
(Thanks Oliver) because otherwise none of the MPI or Adjoint test experiment would compile.

I have not yet submitted a PR with this new optfile (might try to modified the
current one to handle both cases), but started to use it for daily test since Dec 16,
and it seems to work (you can check on testing page).

One remaining issue is that the test using MPI and specially the Adjoint test with MPI
sometime fail. This occurs more often with any of global_ocean.90x40x15 or isomip experiments, e.g.:  
 e.g.: > grep '^Y Y Y N' tr_batsi_20201231_0/summary.txt
Y Y Y N .. .. .. N/O   global_ocean.90x40x15.bottomdrag
Y Y Y N .. .. .. N/O   isomip  (e=0, w=4)
  or:  > grep '^Y Y Y N' tr_batsi_20210102_0/summary.txt
Y Y Y N .. .. .. N/O   global_ocean.90x40x15  (e=0, w=15)
Y Y Y N .. .. .. N/O   isomip.htd
The error is not reproducible: after a fail, I can try again with same executable and it can either
fail again with a different error or it can just run fine.

The type of error I have seen is either:
1) Floating point exception like:
> Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
or
> [batsi:1593159:0:1593159] Caught signal 8 (Floating point exception: floating-point invalid operation)
2) an MPI related error like this one:
> 1608995370.889564] [batsi:997058:0]           sock.c:344  UCX  ERROR recv(fd=29) failed: Bad address
> [1608995370.889735] [batsi:997061:0]           sock.c:344  UCX  ERROR recv(fd=30) failed: Connection reset by peer
> [1608995371.117227] [batsi:997061:0]           sock.c:344  UCX  ERROR sendv(fd=-1) failed: Bad file descriptor
> [batsi:997061] *** An error occurred in MPI_Send
> [batsi:997061] *** reported by process [2123890689,3]
> [batsi:997061] *** on communicator MPI_COMM_WORLD
> [batsi:997061] *** MPI_ERR_OTHER: known error not in list
> [batsi:997061] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> [batsi:997061] ***    and potentially your MPI job)

It might have something to do with batsi (or some memory limit) since it's not reproducible.
Any idea there ?

Thanks,
Jean-Michel


More information about the MITgcm-devel mailing list