[MITgcm-support] A very strange problem
ouc.edu.cn
ouc.edu.cn at 163.com
Sun Jun 6 03:48:43 EDT 2010
Hi Yuan,
It would be not possible to run my model without MPI,cause the model domain is rather large (200*3000*100 grids). And do you think this is the problem of MPI? Why is that ?
Thanks.
Cheers,
Dwight
在2010-06-06 13:20:59,mitgcm-support-request at mitgcm.org 写道:
>Send MITgcm-support mailing list submissions to
> mitgcm-support at mitgcm.org
>
>To subscribe or unsubscribe via the World Wide Web, visit
> http://mitgcm.org/mailman/listinfo/mitgcm-support
>or, via email, send a message with subject or body 'help' to
> mitgcm-support-request at mitgcm.org
>
>You can reach the person managing the list at
> mitgcm-support-owner at mitgcm.org
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of MITgcm-support digest..."
>
>
>Today's Topics:
>
> 1. Re: A very strange problem (Yuan Lian) (Yuan Lian)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Sat, 05 Jun 2010 22:20:42 -0700
>From: Yuan Lian <lian at ashimaresearch.com>
>To: mitgcm-support at mitgcm.org
>Subject: Re: [MITgcm-support] A very strange problem (Yuan Lian)
>Message-ID: <4C0B302A.6030803 at ashimaresearch.com>
>Content-Type: text/plain; charset="x-gbk"; Format="flowed"
>
>Hi,
>
>I would suggest you run the code without MPI (if you haven't done so)
>and see if you still have the same problem.
>
>Yuan
>
>
>On 6/5/10 7:39 PM, ouc.edu.cn wrote:
>> Hi again,
>> Thank you very mucn for all your suggestions. But this strange problem
>> is still unresolved.
>> It looks like not the problem of large time step, since when I reduce
>> it, and the advcfl_W_hf_max well drop below 0.1, the model still
>> terminates without any sign of blowing up. And the same error:
>> [node4:16165] *** Process received signal ***
>> [node4:16165] Signal: Floating point exception (8)
>> [node4:16165] Signal code: Floating point divide-by-zero (3)
>> [node4:16165] Failing at address: 0x4b8e76
>> [node4:16165] [ 0] /lib64/libpthread.so.0 [0x2b41bcf60c00]
>> [node4:16165] [ 1] ./mitgcmuv_32p_gcc [0x4b8e76]
>> [node4:16165] *** End of error message ***
>> Again, I can not see any sign of blowing up in the 'STDOUT' files. As
>> for Yuan's suggestion, it is not the problem. All the initial files
>> are good and well initialized, and double precision is used.
>> I suspect this may be the problem of the cluster itself. Two months
>> ago my model with the very same configuration can run correctly, but
>> now this strange problem appears. Since I have little knowledge on the
>> cluster stuff. So any of your advice is appreciated.
>> Cheers,
>> Dwight
>>
>> ??2010-06-04 02:46:56??mitgcm-support-request at mitgcm.org ??????
>> >Send MITgcm-support mailing list submissions to
>> > mitgcm-support at mitgcm.org
>> >
>> >To subscribe or unsubscribe via the World Wide Web, visit
>> > http://mitgcm.org/mailman/listinfo/mitgcm-support
>> >or, via email, send a message with subject or body 'help' to
>> > mitgcm-support-request at mitgcm.org
>> >
>> >You can reach the person managing the list at
>> > mitgcm-support-owner at mitgcm.org
>> >
>> >When replying, please edit your Subject line so it is more specific
>> >than "Re: Contents of MITgcm-support digest..."
>> >
>> >
>> >Today's Topics:
>> >
>> > 1. Re: A very strange problem (m. r. schaferkotter)
>> > 2. Re: A very strange problem (Yuan Lian)
>> >
>> >
>> >----------------------------------------------------------------------
>> >
>> >Message: 1
>> >Date: Thu, 3 Jun 2010 13:28:03 -0500
>> >From: "m. r. schaferkotter"<schaferk at bellsouth.net>
>> >To: mitgcm-support at mitgcm.org
>> >Subject: Re: [MITgcm-support] A very strange problem
>> >Message-ID:<0707528E-9AF5-4435-B0D4-AC0C1471343D at bellsouth.net>
>> >Content-Type: text/plain; charset="gb2312"; Format="flowed";
>> > DelSp="yes"
>> >
>> >try reducing the time step as
>> >
>> >(PID.TID 0000.0001) %MON advcfl_W_hf_max =
>> >2.2335820994742E-01
>> >
>> >could indicate a smaller time step.
>> >
>> >
>> >On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
>> >
>> >> Hi Jean-Michel,
>> >> Thank you for your reply.
>> >> From the 'STDOUT.00xx' files I'm sure the model didn't blow up,
>> >> cause everything looks nomal, and all these 32 files are kind of
>> >> the same and contain almost the same information. Below is the
>> >> information of the last two iterations:
>> >> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> >> (PID.TID 0000.0001) //
>> >> =======================================================
>> >> (PID.TID 0000.0001) %MON time_tsnumber
>> >> = 316
>> >> (PID.TID 0000.0001) %MON time_secondsf =
>> >> 3.9500000000000E+03
>> >> (PID.TID 0000.0001) %MON dynstat_eta_max =
>> >> 4.2351103155441E-01
>> >> (PID.TID 0000.0001) %MON dynstat_eta_min =
>> >> -2.9095841140222E-01
>> >> (PID.TID 0000.0001) %MON dynstat_eta_mean =
>> >> 2.2266055703883E-05
>> >> (PID.TID 0000.0001) %MON dynstat_eta_sd =
>> >> 2.4506462290779E-03
>> >> (PID.TID 0000.0001) %MON dynstat_eta_del2 =
>> >> 1.3655537149458E-03
>> >> (PID.TID 0000.0001) %MON dynstat_uvel_max =
>> >> 2.9081925049449E+00
>> >> (PID.TID 0000.0001) %MON dynstat_uvel_min =
>> >> -2.3945333213539E+00
>> >> (PID.TID 0000.0001) %MON dynstat_uvel_mean =
>> >> -5.9096408449438E-05
>> >> (PID.TID 0000.0001) %MON dynstat_uvel_sd =
>> >> 3.4343631952692E-03
>> >> (PID.TID 0000.0001) %MON dynstat_uvel_del2 =
>> >> 4.7379871670678E-03
>> >> (PID.TID 0000.0001) %MON dynstat_vvel_max =
>> >> 2.5933981864330E+00
>> >> (PID.TID 0000.0001) %MON dynstat_vvel_min =
>> >> -3.2231876263118E+00
>> >> (PID.TID 0000.0001) %MON dynstat_vvel_mean =
>> >> 1.4348687349928E-06
>> >> (PID.TID 0000.0001) %MON dynstat_vvel_sd =
>> >> 1.2624775236425E-03
>> >> (PID.TID 0000.0001) %MON dynstat_vvel_del2 =
>> >> 3.8178362135143E-03
>> >> (PID.TID 0000.0001) %MON dynstat_wvel_max =
>> >> 2.9846705628259E-01
>> >> (PID.TID 0000.0001) %MON dynstat_wvel_min =
>> >> -2.3853942232519E-01
>> >> (PID.TID 0000.0001) %MON dynstat_wvel_mean =
>> >> -2.1033468915427E-09
>> >> (PID.TID 0000.0001) %MON dynstat_wvel_sd =
>> >> 1.9417325836487E-04
>> >> (PID.TID 0000.0001) %MON dynstat_wvel_del2 =
>> >> 9.3761000810601E-04
>> >> (PID.TID 0000.0001) %MON dynstat_theta_max =
>> >> 3.7963165120273E+01
>> >> (PID.TID 0000.0001) %MON dynstat_theta_min =
>> >> -2.1436513114643E+01
>> >> (PID.TID 0000.0001) %MON dynstat_theta_mean =
>> >> 5.9453514759821E+00
>> >> (PID.TID 0000.0001) %MON dynstat_theta_sd =
>> >> 6.2570503287055E+00
>> >> (PID.TID 0000.0001) %MON dynstat_theta_del2 =
>> >> 2.5262750966653E-02
>> >> (PID.TID 0000.0001) %MON dynstat_salt_max =
>> >> 3.8107582424356E+01
>> >> (PID.TID 0000.0001) %MON dynstat_salt_min =
>> >> 0.0000000000000E+00
>> >> (PID.TID 0000.0001) %MON dynstat_salt_mean =
>> >> 3.4534555031168E+01
>> >> (PID.TID 0000.0001) %MON dynstat_salt_sd =
>> >> 1.2887956836604E-01
>> >> (PID.TID 0000.0001) %MON dynstat_salt_del2 =
>> >> 2.5964953005350E-03
>> >> (PID.TID 0000.0001) %MON advcfl_uvel_max =
>> >> 1.4540962524725E-01
>> >> (PID.TID 0000.0001) %MON advcfl_vvel_max =
>> >> 4.0289845328898E-02
>> >> (PID.TID 0000.0001) %MON advcfl_wvel_max =
>> >> 2.1666787542532E-01
>> >> (PID.TID 0000.0001) %MON advcfl_W_hf_max =
>> >> 2.2335820994742E-01
>> >> (PID.TID 0000.0001) %MON pe_b_mean =
>> >> 1.2621239225557E-08
>> >> (PID.TID 0000.0001) %MON ke_max =
>> >> 4.6011781196852E+00
>> >> (PID.TID 0000.0001) %MON ke_mean =
>> >> 4.6615962365024E-06
>> >> (PID.TID 0000.0001) %MON ke_vol =
>> >> 9.9365095001242E+17
>> >> (PID.TID 0000.0001) %MON vort_r_min =
>> >> -7.2511662603239E-03
>> >> (PID.TID 0000.0001) %MON vort_r_max =
>> >> 6.4558430600778E-03
>> >> (PID.TID 0000.0001) %MON vort_a_mean =
>> >> 5.0662999065604E-05
>> >> (PID.TID 0000.0001) %MON vort_a_sd =
>> >> 7.4966761920324E-07
>> >> (PID.TID 0000.0001) %MON vort_p_mean =
>> >> 7.3612614840296E-05
>> >> (PID.TID 0000.0001) %MON vort_p_sd =
>> >> 5.9336048842539E-05
>> >> (PID.TID 0000.0001) %MON surfExpan_theta_mean =
>> >> -9.9981862111201E-08
>> >> (PID.TID 0000.0001) %MON surfExpan_salt_mean =
>> >> -1.1922103976760E-07
>> >> (PID.TID 0000.0001) //
>> >> =======================================================
>> >> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
>> >> (PID.TID 0000.0001) //
>> >> =======================================================
>> >> cg2d: Sum(rhs),rhsMax = -4.85042010003101E+00 3.71869645807112E-02
>> >> (PID.TID 0000.0001) cg2d_init_res =
>> >> 4.74339332785140E-02
>> >> (PID.TID 0000.0001) cg2d_iters = 126
>> >> (PID.TID 0000.0001) cg2d_res =
>> >> 9.36545477468715E-14
>> >> cg3d: Sum(rhs),rhsMax = 3.24606069488422E-10 1.20245356700100E-09
>> >> (PID.TID 0000.0001) cg3d_init_res =
>> >> 1.02771503214190E-01
>> >> (PID.TID 0000.0001) cg3d_iters = 20
>> >> (PID.TID 0000.0001) cg3d_res =
>> >> 3.71423006414093E-03
>> >> (PID.TID 0000.0001) //
>> >> =======================================================
>> >> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> >> (PID.TID 0000.0001) //
>> >> =======================================================
>> >> (PID.TID 0000.0001) %MON time_tsnumber
>> >> = 317
>> >> (PID.TID 0000.0001) %MON time_secondsf =
>> >> 3.9625000000000E+03
>> >> (PID.TID 0000.0001) %MON dynstat_eta_max =
>> >> 4.2644526849910E-01
>> >>
>> >> And I can not see anything abnormal,(I increased cg3dMaxIters to a
>> >> much larger number this afternoon, and the model even stoped with
>> >> less time stops).
>> >> In sum, according to the STDOUT files, the model did not blowup, but
>> >> the machine did give an error message, i.e.,
>> >> [node7:03388] Signal: Floating point exception (8)
>> >> [node7:03388] Signal code: Floating point divide-by-zero (3)
>> >> in file "my_job.o1562"
>> >> This problem never happened when I set coriolis force f0=0, and
>> >> initial V velocity to zero. However, when I changed either of these
>> >> two parameters, the problem appears. I'm really confused,,, Do you
>> >> think there is a bug in this model or something ?
>> >> Best Wishes,
>> >> Dwight
>> >>
>> >>
>> >> ?????????????????????
>> >> _______________________________________________
>> >> MITgcm-support mailing list
>> >> MITgcm-support at mitgcm.org
>> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> >
>> >-------------- next part --------------
>> >An HTML attachment was scrubbed...
>> >URL:<http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/99737e70/attachment-0001.htm>
>> >
>> >------------------------------
>> >
>> >Message: 2
>> >Date: Thu, 03 Jun 2010 11:46:45 -0700
>> >From: Yuan Lian<lian at ashimaresearch.com>
>> >To: mitgcm-support at mitgcm.org
>> >Subject: Re: [MITgcm-support] A very strange problem
>> >Message-ID:<4C07F895.3080000 at ashimaresearch.com>
>> >Content-Type: text/plain; charset="gb2312"
>> >
>> >The number doesn't seem to be very large ...
>> >
>> >Maybe there is truncating error that makes the "divide by zero" occur?
>> >Have you checked if the initial data files been generated/or read by
>> >MITgcm correctly, i.e., all initial data files use double precision?
>> >
>> >Yuan
>> >
>> >
>> >
>> >On 6/3/10 11:28 AM, m. r. schaferkotter wrote:
>> >> try reducing the time step as
>> >>
>> >> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
>> >>
>> >> could indicate a smaller time step.
>> >>
>> >>
>> >> On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
>> >>
>> >>> Hi Jean-Michel,
>> >>> Thank you for your reply.
>> >>> From the 'STDOUT.00xx' files I'm sure the model didn't blow up, cause
>> >>> everything looks nomal, and all these 32 files are kind of the same
>> >>> and contain almost the same information. Below is the information of
>> >>> the last two iterations:
>> >>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> >>> (PID.TID 0000.0001) //
>> >>> =======================================================
>> >>> (PID.TID 0000.0001) %MON time_tsnumber = 316
>> >>> (PID.TID 0000.0001) %MON time_secondsf = 3.9500000000000E+03
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2351103155441E-01
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_min = -2.9095841140222E-01
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_mean = 2.2266055703883E-05
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_sd = 2.4506462290779E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_del2 = 1.3655537149458E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_uvel_max = 2.9081925049449E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_uvel_min = -2.3945333213539E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_uvel_mean = -5.9096408449438E-05
>> >>> (PID.TID 0000.0001) %MON dynstat_uvel_sd = 3.4343631952692E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_uvel_del2 = 4.7379871670678E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_vvel_max = 2.5933981864330E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_vvel_min = -3.2231876263118E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_vvel_mean = 1.4348687349928E-06
>> >>> (PID.TID 0000.0001) %MON dynstat_vvel_sd = 1.2624775236425E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_vvel_del2 = 3.8178362135143E-03
>> >>> (PID.TID 0000.0001) %MON dynstat_wvel_max = 2.9846705628259E-01
>> >>> (PID.TID 0000.0001) %MON dynstat_wvel_min = -2.3853942232519E-01
>> >>> (PID.TID 0000.0001) %MON dynstat_wvel_mean = -2.1033468915427E-09
>> >>> (PID.TID 0000.0001) %MON dynstat_wvel_sd = 1.9417325836487E-04
>> >>> (PID.TID 0000.0001) %MON dynstat_wvel_del2 = 9.3761000810601E-04
>> >>> (PID.TID 0000.0001) %MON dynstat_theta_max = 3.7963165120273E+01
>> >>> (PID.TID 0000.0001) %MON dynstat_theta_min = -2.1436513114643E+01
>> >>> (PID.TID 0000.0001) %MON dynstat_theta_mean = 5.9453514759821E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_theta_sd = 6.2570503287055E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_theta_del2 = 2.5262750966653E-02
>> >>> (PID.TID 0000.0001) %MON dynstat_salt_max = 3.8107582424356E+01
>> >>> (PID.TID 0000.0001) %MON dynstat_salt_min = 0.0000000000000E+00
>> >>> (PID.TID 0000.0001) %MON dynstat_salt_mean = 3.4534555031168E+01
>> >>> (PID.TID 0000.0001) %MON dynstat_salt_sd = 1.2887956836604E-01
>> >>> (PID.TID 0000.0001) %MON dynstat_salt_del2 = 2.5964953005350E-03
>> >>> (PID.TID 0000.0001) %MON advcfl_uvel_max = 1.4540962524725E-01
>> >>> (PID.TID 0000.0001) %MON advcfl_vvel_max = 4.0289845328898E-02
>> >>> (PID.TID 0000.0001) %MON advcfl_wvel_max = 2.1666787542532E-01
>> >>> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
>> >>> (PID.TID 0000.0001) %MON pe_b_mean = 1.2621239225557E-08
>> >>> (PID.TID 0000.0001) %MON ke_max = 4.6011781196852E+00
>> >>> (PID.TID 0000.0001) %MON ke_mean = 4.6615962365024E-06
>> >>> (PID.TID 0000.0001) %MON ke_vol = 9.9365095001242E+17
>> >>> (PID.TID 0000.0001) %MON vort_r_min = -7.2511662603239E-03
>> >>> (PID.TID 0000.0001) %MON vort_r_max = 6.4558430600778E-03
>> >>> (PID.TID 0000.0001) %MON vort_a_mean = 5.0662999065604E-05
>> >>> (PID.TID 0000.0001) %MON vort_a_sd = 7.4966761920324E-07
>> >>> (PID.TID 0000.0001) %MON vort_p_mean = 7.3612614840296E-05
>> >>> (PID.TID 0000.0001) %MON vort_p_sd = 5.9336048842539E-05
>> >>> (PID.TID 0000.0001) %MON surfExpan_theta_mean = -9.9981862111201E-08
>> >>> (PID.TID 0000.0001) %MON surfExpan_salt_mean = -1.1922103976760E-07
>> >>> (PID.TID 0000.0001) //
>> >>> =======================================================
>> >>> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
>> >>> (PID.TID 0000.0001) //
>> >>> =======================================================
>> >>> cg2d: Sum(rhs),rhsMax = -4.85042010003101E+00 3.71869645807112E-02
>> >>> (PID.TID 0000.0001) cg2d_init_res = 4.74339332785140E-02
>> >>> (PID.TID 0000.0001) cg2d_iters = 126
>> >>> (PID.TID 0000.0001) cg2d_res = 9.36545477468715E-14
>> >>> cg3d: Sum(rhs),rhsMax = 3.24606069488422E-10 1.20245356700100E-09
>> >>> (PID.TID 0000.0001) cg3d_init_res = 1.02771503214190E-01
>> >>> (PID.TID 0000.0001) cg3d_iters = 20
>> >>> (PID.TID 0000.0001) cg3d_res = 3.71423006414093E-03
>> >>> (PID.TID 0000.0001) //
>> >>> =======================================================
>> >>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> >>> (PID.TID 0000.0001) //
>> >>> =======================================================
>> >>> (PID.TID 0000.0001) %MON time_tsnumber = 317
>> >>> (PID.TID 0000.0001) %MON time_secondsf = 3.9625000000000E+03
>> >>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2644526849910E-01
>> >>> And I can not see anything abnormal,(I increased cg3dMaxIters to a
>> >>> much larger number this afternoon, and the model even stoped with
>> >>> less time stops).
>> >>> In sum, according to the STDOUT files, the model did not blowup, but
>> >>> the machine did give an error message, i.e.,
>> >>> [node7:03388] Signal: Floating point exception (8)
>> >>> [node7:03388] Signal code: Floating point divide-by-zero (3)
>> >>> in file "my_job.o1562"
>> >>> This problem never happened when I set coriolis force f0=0, and
>> >>> initial V velocity to zero. However, when I changed either of these
>> >>> two parameters, the problem appears. I'm really confused,,, Do you
>> >>> think there is a bug in this model or something ?
>> >>> Best Wishes,
>> >>> Dwight
>> >>>
>> >>>
>> >>> ------------------------------------------------------------------------
>> >>> ?????????????????????
>> >>> <http://ym.163.com/?from=od3>
>> >>> _______________________________________________
>> >>> MITgcm-support mailing list
>> >>> MITgcm-support at mitgcm.org<mailto:MITgcm-support at mitgcm.org>
>> >>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> >>
>> >>
>> >> _______________________________________________
>> >> MITgcm-support mailing list
>> >> MITgcm-support at mitgcm.org
>> >> http://mitgcm.org/mailman/listinfo/mitgcm-support
>> >>
>> >
>> >-------------- next part --------------
>> >An HTML attachment was scrubbed...
>> >URL:<http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/17bf6169/attachment.htm>
>> >
>> >------------------------------
>> >
>> >_______________________________________________
>> >MITgcm-support mailing list
>> >MITgcm-support at mitgcm.org
>> >http://mitgcm.org/mailman/listinfo/mitgcm-support
>> >
>> >
>> >End of MITgcm-support Digest, Vol 84, Issue 4
>> >*********************************************
>>
>>
>>
>> ------------------------------------------------------------------------
>> ?????????????????????????????????????????? <http://ym.163.com/?from=od3>
>>
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100605/680c6ded/attachment.htm>
>
>------------------------------
>
>_______________________________________________
>MITgcm-support mailing list
>MITgcm-support at mitgcm.org
>http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>End of MITgcm-support Digest, Vol 84, Issue 8
>*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100606/27a50f9b/attachment-0001.htm>
More information about the MITgcm-support
mailing list