[MITgcm-support] A very strange problem (Yuan Lian)
Yuan Lian
lian at ashimaresearch.com
Sun Jun 6 01:20:42 EDT 2010
Hi,
I would suggest you run the code without MPI (if you haven't done so)
and see if you still have the same problem.
Yuan
On 6/5/10 7:39 PM, ouc.edu.cn wrote:
> Hi again,
> Thank you very mucn for all your suggestions. But this strange problem
> is still unresolved.
> It looks like not the problem of large time step, since when I reduce
> it, and the advcfl_W_hf_max well drop below 0.1, the model still
> terminates without any sign of blowing up. And the same error:
> [node4:16165] *** Process received signal ***
> [node4:16165] Signal: Floating point exception (8)
> [node4:16165] Signal code: Floating point divide-by-zero (3)
> [node4:16165] Failing at address: 0x4b8e76
> [node4:16165] [ 0] /lib64/libpthread.so.0 [0x2b41bcf60c00]
> [node4:16165] [ 1] ./mitgcmuv_32p_gcc [0x4b8e76]
> [node4:16165] *** End of error message ***
> Again, I can not see any sign of blowing up in the 'STDOUT' files. As
> for Yuan's suggestion, it is not the problem. All the initial files
> are good and well initialized, and double precision is used.
> I suspect this may be the problem of the cluster itself. Two months
> ago my model with the very same configuration can run correctly, but
> now this strange problem appears. Since I have little knowledge on the
> cluster stuff. So any of your advice is appreciated.
> Cheers,
> Dwight
>
> ÔÚ2010-06-04 02:46:56£¬mitgcm-support-request at mitgcm.org дµÀ£º
> >Send MITgcm-support mailing list submissions to
> > mitgcm-support at mitgcm.org
> >
> >To subscribe or unsubscribe via the World Wide Web, visit
> > http://mitgcm.org/mailman/listinfo/mitgcm-support
> >or, via email, send a message with subject or body 'help' to
> > mitgcm-support-request at mitgcm.org
> >
> >You can reach the person managing the list at
> > mitgcm-support-owner at mitgcm.org
> >
> >When replying, please edit your Subject line so it is more specific
> >than "Re: Contents of MITgcm-support digest..."
> >
> >
> >Today's Topics:
> >
> > 1. Re: A very strange problem (m. r. schaferkotter)
> > 2. Re: A very strange problem (Yuan Lian)
> >
> >
> >----------------------------------------------------------------------
> >
> >Message: 1
> >Date: Thu, 3 Jun 2010 13:28:03 -0500
> >From: "m. r. schaferkotter"<schaferk at bellsouth.net>
> >To: mitgcm-support at mitgcm.org
> >Subject: Re: [MITgcm-support] A very strange problem
> >Message-ID:<0707528E-9AF5-4435-B0D4-AC0C1471343D at bellsouth.net>
> >Content-Type: text/plain; charset="gb2312"; Format="flowed";
> > DelSp="yes"
> >
> >try reducing the time step as
> >
> >(PID.TID 0000.0001) %MON advcfl_W_hf_max =
> >2.2335820994742E-01
> >
> >could indicate a smaller time step.
> >
> >
> >On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
> >
> >> Hi Jean-Michel,
> >> Thank you for your reply.
> >> From the 'STDOUT.00xx' files I'm sure the model didn't blow up,
> >> cause everything looks nomal, and all these 32 files are kind of
> >> the same and contain almost the same information. Below is the
> >> information of the last two iterations:
> >> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
> >> (PID.TID 0000.0001) //
> >> =======================================================
> >> (PID.TID 0000.0001) %MON time_tsnumber
> >> = 316
> >> (PID.TID 0000.0001) %MON time_secondsf =
> >> 3.9500000000000E+03
> >> (PID.TID 0000.0001) %MON dynstat_eta_max =
> >> 4.2351103155441E-01
> >> (PID.TID 0000.0001) %MON dynstat_eta_min =
> >> -2.9095841140222E-01
> >> (PID.TID 0000.0001) %MON dynstat_eta_mean =
> >> 2.2266055703883E-05
> >> (PID.TID 0000.0001) %MON dynstat_eta_sd =
> >> 2.4506462290779E-03
> >> (PID.TID 0000.0001) %MON dynstat_eta_del2 =
> >> 1.3655537149458E-03
> >> (PID.TID 0000.0001) %MON dynstat_uvel_max =
> >> 2.9081925049449E+00
> >> (PID.TID 0000.0001) %MON dynstat_uvel_min =
> >> -2.3945333213539E+00
> >> (PID.TID 0000.0001) %MON dynstat_uvel_mean =
> >> -5.9096408449438E-05
> >> (PID.TID 0000.0001) %MON dynstat_uvel_sd =
> >> 3.4343631952692E-03
> >> (PID.TID 0000.0001) %MON dynstat_uvel_del2 =
> >> 4.7379871670678E-03
> >> (PID.TID 0000.0001) %MON dynstat_vvel_max =
> >> 2.5933981864330E+00
> >> (PID.TID 0000.0001) %MON dynstat_vvel_min =
> >> -3.2231876263118E+00
> >> (PID.TID 0000.0001) %MON dynstat_vvel_mean =
> >> 1.4348687349928E-06
> >> (PID.TID 0000.0001) %MON dynstat_vvel_sd =
> >> 1.2624775236425E-03
> >> (PID.TID 0000.0001) %MON dynstat_vvel_del2 =
> >> 3.8178362135143E-03
> >> (PID.TID 0000.0001) %MON dynstat_wvel_max =
> >> 2.9846705628259E-01
> >> (PID.TID 0000.0001) %MON dynstat_wvel_min =
> >> -2.3853942232519E-01
> >> (PID.TID 0000.0001) %MON dynstat_wvel_mean =
> >> -2.1033468915427E-09
> >> (PID.TID 0000.0001) %MON dynstat_wvel_sd =
> >> 1.9417325836487E-04
> >> (PID.TID 0000.0001) %MON dynstat_wvel_del2 =
> >> 9.3761000810601E-04
> >> (PID.TID 0000.0001) %MON dynstat_theta_max =
> >> 3.7963165120273E+01
> >> (PID.TID 0000.0001) %MON dynstat_theta_min =
> >> -2.1436513114643E+01
> >> (PID.TID 0000.0001) %MON dynstat_theta_mean =
> >> 5.9453514759821E+00
> >> (PID.TID 0000.0001) %MON dynstat_theta_sd =
> >> 6.2570503287055E+00
> >> (PID.TID 0000.0001) %MON dynstat_theta_del2 =
> >> 2.5262750966653E-02
> >> (PID.TID 0000.0001) %MON dynstat_salt_max =
> >> 3.8107582424356E+01
> >> (PID.TID 0000.0001) %MON dynstat_salt_min =
> >> 0.0000000000000E+00
> >> (PID.TID 0000.0001) %MON dynstat_salt_mean =
> >> 3.4534555031168E+01
> >> (PID.TID 0000.0001) %MON dynstat_salt_sd =
> >> 1.2887956836604E-01
> >> (PID.TID 0000.0001) %MON dynstat_salt_del2 =
> >> 2.5964953005350E-03
> >> (PID.TID 0000.0001) %MON advcfl_uvel_max =
> >> 1.4540962524725E-01
> >> (PID.TID 0000.0001) %MON advcfl_vvel_max =
> >> 4.0289845328898E-02
> >> (PID.TID 0000.0001) %MON advcfl_wvel_max =
> >> 2.1666787542532E-01
> >> (PID.TID 0000.0001) %MON advcfl_W_hf_max =
> >> 2.2335820994742E-01
> >> (PID.TID 0000.0001) %MON pe_b_mean =
> >> 1.2621239225557E-08
> >> (PID.TID 0000.0001) %MON ke_max =
> >> 4.6011781196852E+00
> >> (PID.TID 0000.0001) %MON ke_mean =
> >> 4.6615962365024E-06
> >> (PID.TID 0000.0001) %MON ke_vol =
> >> 9.9365095001242E+17
> >> (PID.TID 0000.0001) %MON vort_r_min =
> >> -7.2511662603239E-03
> >> (PID.TID 0000.0001) %MON vort_r_max =
> >> 6.4558430600778E-03
> >> (PID.TID 0000.0001) %MON vort_a_mean =
> >> 5.0662999065604E-05
> >> (PID.TID 0000.0001) %MON vort_a_sd =
> >> 7.4966761920324E-07
> >> (PID.TID 0000.0001) %MON vort_p_mean =
> >> 7.3612614840296E-05
> >> (PID.TID 0000.0001) %MON vort_p_sd =
> >> 5.9336048842539E-05
> >> (PID.TID 0000.0001) %MON surfExpan_theta_mean =
> >> -9.9981862111201E-08
> >> (PID.TID 0000.0001) %MON surfExpan_salt_mean =
> >> -1.1922103976760E-07
> >> (PID.TID 0000.0001) //
> >> =======================================================
> >> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
> >> (PID.TID 0000.0001) //
> >> =======================================================
> >> cg2d: Sum(rhs),rhsMax = -4.85042010003101E+00 3.71869645807112E-02
> >> (PID.TID 0000.0001) cg2d_init_res =
> >> 4.74339332785140E-02
> >> (PID.TID 0000.0001) cg2d_iters = 126
> >> (PID.TID 0000.0001) cg2d_res =
> >> 9.36545477468715E-14
> >> cg3d: Sum(rhs),rhsMax = 3.24606069488422E-10 1.20245356700100E-09
> >> (PID.TID 0000.0001) cg3d_init_res =
> >> 1.02771503214190E-01
> >> (PID.TID 0000.0001) cg3d_iters = 20
> >> (PID.TID 0000.0001) cg3d_res =
> >> 3.71423006414093E-03
> >> (PID.TID 0000.0001) //
> >> =======================================================
> >> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
> >> (PID.TID 0000.0001) //
> >> =======================================================
> >> (PID.TID 0000.0001) %MON time_tsnumber
> >> = 317
> >> (PID.TID 0000.0001) %MON time_secondsf =
> >> 3.9625000000000E+03
> >> (PID.TID 0000.0001) %MON dynstat_eta_max =
> >> 4.2644526849910E-01
> >>
> >> And I can not see anything abnormal,(I increased cg3dMaxIters to a
> >> much larger number this afternoon, and the model even stoped with
> >> less time stops).
> >> In sum, according to the STDOUT files, the model did not blowup, but
> >> the machine did give an error message, i.e.,
> >> [node7:03388] Signal: Floating point exception (8)
> >> [node7:03388] Signal code: Floating point divide-by-zero (3)
> >> in file "my_job.o1562"
> >> This problem never happened when I set coriolis force f0=0, and
> >> initial V velocity to zero. However, when I changed either of these
> >> two parameters, the problem appears. I'm really confused,,, Do you
> >> think there is a bug in this model or something ?
> >> Best Wishes,
> >> Dwight
> >>
> >>
> >> ?????????????????????
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >-------------- next part --------------
> >An HTML attachment was scrubbed...
> >URL:<http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/99737e70/attachment-0001.htm>
> >
> >------------------------------
> >
> >Message: 2
> >Date: Thu, 03 Jun 2010 11:46:45 -0700
> >From: Yuan Lian<lian at ashimaresearch.com>
> >To: mitgcm-support at mitgcm.org
> >Subject: Re: [MITgcm-support] A very strange problem
> >Message-ID:<4C07F895.3080000 at ashimaresearch.com>
> >Content-Type: text/plain; charset="gb2312"
> >
> >The number doesn't seem to be very large ...
> >
> >Maybe there is truncating error that makes the "divide by zero" occur?
> >Have you checked if the initial data files been generated/or read by
> >MITgcm correctly, i.e., all initial data files use double precision?
> >
> >Yuan
> >
> >
> >
> >On 6/3/10 11:28 AM, m. r. schaferkotter wrote:
> >> try reducing the time step as
> >>
> >> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
> >>
> >> could indicate a smaller time step.
> >>
> >>
> >> On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
> >>
> >>> Hi Jean-Michel,
> >>> Thank you for your reply.
> >>> From the 'STDOUT.00xx' files I'm sure the model didn't blow up, cause
> >>> everything looks nomal, and all these 32 files are kind of the same
> >>> and contain almost the same information. Below is the information of
> >>> the last two iterations:
> >>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
> >>> (PID.TID 0000.0001) //
> >>> =======================================================
> >>> (PID.TID 0000.0001) %MON time_tsnumber = 316
> >>> (PID.TID 0000.0001) %MON time_secondsf = 3.9500000000000E+03
> >>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2351103155441E-01
> >>> (PID.TID 0000.0001) %MON dynstat_eta_min = -2.9095841140222E-01
> >>> (PID.TID 0000.0001) %MON dynstat_eta_mean = 2.2266055703883E-05
> >>> (PID.TID 0000.0001) %MON dynstat_eta_sd = 2.4506462290779E-03
> >>> (PID.TID 0000.0001) %MON dynstat_eta_del2 = 1.3655537149458E-03
> >>> (PID.TID 0000.0001) %MON dynstat_uvel_max = 2.9081925049449E+00
> >>> (PID.TID 0000.0001) %MON dynstat_uvel_min = -2.3945333213539E+00
> >>> (PID.TID 0000.0001) %MON dynstat_uvel_mean = -5.9096408449438E-05
> >>> (PID.TID 0000.0001) %MON dynstat_uvel_sd = 3.4343631952692E-03
> >>> (PID.TID 0000.0001) %MON dynstat_uvel_del2 = 4.7379871670678E-03
> >>> (PID.TID 0000.0001) %MON dynstat_vvel_max = 2.5933981864330E+00
> >>> (PID.TID 0000.0001) %MON dynstat_vvel_min = -3.2231876263118E+00
> >>> (PID.TID 0000.0001) %MON dynstat_vvel_mean = 1.4348687349928E-06
> >>> (PID.TID 0000.0001) %MON dynstat_vvel_sd = 1.2624775236425E-03
> >>> (PID.TID 0000.0001) %MON dynstat_vvel_del2 = 3.8178362135143E-03
> >>> (PID.TID 0000.0001) %MON dynstat_wvel_max = 2.9846705628259E-01
> >>> (PID.TID 0000.0001) %MON dynstat_wvel_min = -2.3853942232519E-01
> >>> (PID.TID 0000.0001) %MON dynstat_wvel_mean = -2.1033468915427E-09
> >>> (PID.TID 0000.0001) %MON dynstat_wvel_sd = 1.9417325836487E-04
> >>> (PID.TID 0000.0001) %MON dynstat_wvel_del2 = 9.3761000810601E-04
> >>> (PID.TID 0000.0001) %MON dynstat_theta_max = 3.7963165120273E+01
> >>> (PID.TID 0000.0001) %MON dynstat_theta_min = -2.1436513114643E+01
> >>> (PID.TID 0000.0001) %MON dynstat_theta_mean = 5.9453514759821E+00
> >>> (PID.TID 0000.0001) %MON dynstat_theta_sd = 6.2570503287055E+00
> >>> (PID.TID 0000.0001) %MON dynstat_theta_del2 = 2.5262750966653E-02
> >>> (PID.TID 0000.0001) %MON dynstat_salt_max = 3.8107582424356E+01
> >>> (PID.TID 0000.0001) %MON dynstat_salt_min = 0.0000000000000E+00
> >>> (PID.TID 0000.0001) %MON dynstat_salt_mean = 3.4534555031168E+01
> >>> (PID.TID 0000.0001) %MON dynstat_salt_sd = 1.2887956836604E-01
> >>> (PID.TID 0000.0001) %MON dynstat_salt_del2 = 2.5964953005350E-03
> >>> (PID.TID 0000.0001) %MON advcfl_uvel_max = 1.4540962524725E-01
> >>> (PID.TID 0000.0001) %MON advcfl_vvel_max = 4.0289845328898E-02
> >>> (PID.TID 0000.0001) %MON advcfl_wvel_max = 2.1666787542532E-01
> >>> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
> >>> (PID.TID 0000.0001) %MON pe_b_mean = 1.2621239225557E-08
> >>> (PID.TID 0000.0001) %MON ke_max = 4.6011781196852E+00
> >>> (PID.TID 0000.0001) %MON ke_mean = 4.6615962365024E-06
> >>> (PID.TID 0000.0001) %MON ke_vol = 9.9365095001242E+17
> >>> (PID.TID 0000.0001) %MON vort_r_min = -7.2511662603239E-03
> >>> (PID.TID 0000.0001) %MON vort_r_max = 6.4558430600778E-03
> >>> (PID.TID 0000.0001) %MON vort_a_mean = 5.0662999065604E-05
> >>> (PID.TID 0000.0001) %MON vort_a_sd = 7.4966761920324E-07
> >>> (PID.TID 0000.0001) %MON vort_p_mean = 7.3612614840296E-05
> >>> (PID.TID 0000.0001) %MON vort_p_sd = 5.9336048842539E-05
> >>> (PID.TID 0000.0001) %MON surfExpan_theta_mean = -9.9981862111201E-08
> >>> (PID.TID 0000.0001) %MON surfExpan_salt_mean = -1.1922103976760E-07
> >>> (PID.TID 0000.0001) //
> >>> =======================================================
> >>> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
> >>> (PID.TID 0000.0001) //
> >>> =======================================================
> >>> cg2d: Sum(rhs),rhsMax = -4.85042010003101E+00 3.71869645807112E-02
> >>> (PID.TID 0000.0001) cg2d_init_res = 4.74339332785140E-02
> >>> (PID.TID 0000.0001) cg2d_iters = 126
> >>> (PID.TID 0000.0001) cg2d_res = 9.36545477468715E-14
> >>> cg3d: Sum(rhs),rhsMax = 3.24606069488422E-10 1.20245356700100E-09
> >>> (PID.TID 0000.0001) cg3d_init_res = 1.02771503214190E-01
> >>> (PID.TID 0000.0001) cg3d_iters = 20
> >>> (PID.TID 0000.0001) cg3d_res = 3.71423006414093E-03
> >>> (PID.TID 0000.0001) //
> >>> =======================================================
> >>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
> >>> (PID.TID 0000.0001) //
> >>> =======================================================
> >>> (PID.TID 0000.0001) %MON time_tsnumber = 317
> >>> (PID.TID 0000.0001) %MON time_secondsf = 3.9625000000000E+03
> >>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2644526849910E-01
> >>> And I can not see anything abnormal,(I increased cg3dMaxIters to a
> >>> much larger number this afternoon, and the model even stoped with
> >>> less time stops).
> >>> In sum, according to the STDOUT files, the model did not blowup, but
> >>> the machine did give an error message, i.e.,
> >>> [node7:03388] Signal: Floating point exception (8)
> >>> [node7:03388] Signal code: Floating point divide-by-zero (3)
> >>> in file "my_job.o1562"
> >>> This problem never happened when I set coriolis force f0=0, and
> >>> initial V velocity to zero. However, when I changed either of these
> >>> two parameters, the problem appears. I'm really confused,,, Do you
> >>> think there is a bug in this model or something ?
> >>> Best Wishes,
> >>> Dwight
> >>>
> >>>
> >>> ------------------------------------------------------------------------
> >>> ?????????????????????
> >>> <http://ym.163.com/?from=od3>
> >>> _______________________________________________
> >>> MITgcm-support mailing list
> >>> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
> >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
> >>
> >>
> >> _______________________________________________
> >> MITgcm-support mailing list
> >> MITgcm-support at mitgcm.org
> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
> >>
> >
> >-------------- next part --------------
> >An HTML attachment was scrubbed...
> >URL:<http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/17bf6169/attachment.htm>
> >
> >------------------------------
> >
> >_______________________________________________
> >MITgcm-support mailing list
> >MITgcm-support at mitgcm.org
> >http://mitgcm.org/mailman/listinfo/mitgcm-support
> >
> >
> >End of MITgcm-support Digest, Vol 84, Issue 4
> >*********************************************
>
>
>
> ------------------------------------------------------------------------
> ÍøÒ×ΪÖÐСÆóÒµÃâ·ÑÌṩÆóÒµÓÊÏ䣨×ÔÖ÷ÓòÃû£© <http://ym.163.com/?from=od3>
>
>
> _______________________________________________
> MITgcm-support mailing list
> MITgcm-support at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100605/680c6ded/attachment-0001.htm>
More information about the MITgcm-support
mailing list