[MITgcm-support] A very strange problem (Yuan Lian)

ouc.edu.cn ouc.edu.cn at 163.com
Sat Jun 5 22:39:38 EDT 2010


Hi again,
  Thank you very mucn for all your suggestions. But this strange problem is still unresolved.
  It looks like not the problem of large time step, since when I reduce it, and the advcfl_W_hf_max well drop below 0.1, the model still terminates without any sign of blowing up. And the same error:
[node4:16165] *** Process received signal ***
[node4:16165] Signal: Floating point exception (8)
[node4:16165] Signal code: Floating point divide-by-zero (3)
[node4:16165] Failing at address: 0x4b8e76
[node4:16165] [ 0] /lib64/libpthread.so.0 [0x2b41bcf60c00]
[node4:16165] [ 1] ./mitgcmuv_32p_gcc [0x4b8e76]
[node4:16165] *** End of error message ***
 
Again, I can not see any sign of blowing up in the 'STDOUT' files. As for Yuan's suggestion, it is not the problem. All the initial files are good and well initialized, and double precision is used.
I suspect this may be the problem of the cluster itself. Two months ago my model with the very same configuration can run correctly, but now this strange problem appears. Since I have little knowledge on the cluster stuff. So any of your advice is appreciated.
Cheers,
Dwight




在2010-06-04 02:46:56,mitgcm-support-request at mitgcm.org 写道:
>Send MITgcm-support mailing list submissions to
>	mitgcm-support at mitgcm.org
>
>To subscribe or unsubscribe via the World Wide Web, visit
>	http://mitgcm.org/mailman/listinfo/mitgcm-support
>or, via email, send a message with subject or body 'help' to
>	mitgcm-support-request at mitgcm.org
>
>You can reach the person managing the list at
>	mitgcm-support-owner at mitgcm.org
>
>When replying, please edit your Subject line so it is more specific
>than "Re: Contents of MITgcm-support digest..."
>
>
>Today's Topics:
>
>   1. Re: A very strange problem (m. r. schaferkotter)
>   2. Re: A very strange problem (Yuan Lian)
>
>
>----------------------------------------------------------------------
>
>Message: 1
>Date: Thu, 3 Jun 2010 13:28:03 -0500
>From: "m. r. schaferkotter" <schaferk at bellsouth.net>
>To: mitgcm-support at mitgcm.org
>Subject: Re: [MITgcm-support] A very strange problem
>Message-ID: <0707528E-9AF5-4435-B0D4-AC0C1471343D at bellsouth.net>
>Content-Type: text/plain; charset="gb2312"; Format="flowed";
>	DelSp="yes"
>
>try reducing the time step as
>
>(PID.TID 0000.0001) %MON advcfl_W_hf_max              =    
>2.2335820994742E-01
>
>could indicate a smaller time step.
>
>
>On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
>
>> Hi Jean-Michel,
>>   Thank you for your reply.
>>   From the 'STDOUT.00xx' files I'm sure the model didn't blow up,  
>> cause everything looks nomal,  and all these 32 files are kind of  
>> the same and contain almost the same information. Below is the  
>> information of the last two iterations:
>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> (PID.TID 0000.0001) //  
>> =======================================================
>> (PID.TID 0000.0001) %MON time_tsnumber                 
>> =                   316
>> (PID.TID 0000.0001) %MON time_secondsf                =    
>> 3.9500000000000E+03
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =    
>> 4.2351103155441E-01
>> (PID.TID 0000.0001) %MON dynstat_eta_min              =   
>> -2.9095841140222E-01
>> (PID.TID 0000.0001) %MON dynstat_eta_mean             =    
>> 2.2266055703883E-05
>> (PID.TID 0000.0001) %MON dynstat_eta_sd               =    
>> 2.4506462290779E-03
>> (PID.TID 0000.0001) %MON dynstat_eta_del2             =    
>> 1.3655537149458E-03
>> (PID.TID 0000.0001) %MON dynstat_uvel_max             =    
>> 2.9081925049449E+00
>> (PID.TID 0000.0001) %MON dynstat_uvel_min             =   
>> -2.3945333213539E+00
>> (PID.TID 0000.0001) %MON dynstat_uvel_mean            =   
>> -5.9096408449438E-05
>> (PID.TID 0000.0001) %MON dynstat_uvel_sd              =    
>> 3.4343631952692E-03
>> (PID.TID 0000.0001) %MON dynstat_uvel_del2            =    
>> 4.7379871670678E-03
>> (PID.TID 0000.0001) %MON dynstat_vvel_max             =    
>> 2.5933981864330E+00
>> (PID.TID 0000.0001) %MON dynstat_vvel_min             =   
>> -3.2231876263118E+00
>> (PID.TID 0000.0001) %MON dynstat_vvel_mean            =    
>> 1.4348687349928E-06
>> (PID.TID 0000.0001) %MON dynstat_vvel_sd              =    
>> 1.2624775236425E-03
>> (PID.TID 0000.0001) %MON dynstat_vvel_del2            =    
>> 3.8178362135143E-03
>> (PID.TID 0000.0001) %MON dynstat_wvel_max             =    
>> 2.9846705628259E-01
>> (PID.TID 0000.0001) %MON dynstat_wvel_min             =   
>> -2.3853942232519E-01
>> (PID.TID 0000.0001) %MON dynstat_wvel_mean            =   
>> -2.1033468915427E-09
>> (PID.TID 0000.0001) %MON dynstat_wvel_sd              =    
>> 1.9417325836487E-04
>> (PID.TID 0000.0001) %MON dynstat_wvel_del2            =    
>> 9.3761000810601E-04
>> (PID.TID 0000.0001) %MON dynstat_theta_max            =    
>> 3.7963165120273E+01
>> (PID.TID 0000.0001) %MON dynstat_theta_min            =   
>> -2.1436513114643E+01
>> (PID.TID 0000.0001) %MON dynstat_theta_mean           =    
>> 5.9453514759821E+00
>> (PID.TID 0000.0001) %MON dynstat_theta_sd             =    
>> 6.2570503287055E+00
>> (PID.TID 0000.0001) %MON dynstat_theta_del2           =    
>> 2.5262750966653E-02
>> (PID.TID 0000.0001) %MON dynstat_salt_max             =    
>> 3.8107582424356E+01
>> (PID.TID 0000.0001) %MON dynstat_salt_min             =    
>> 0.0000000000000E+00
>> (PID.TID 0000.0001) %MON dynstat_salt_mean            =    
>> 3.4534555031168E+01
>> (PID.TID 0000.0001) %MON dynstat_salt_sd              =    
>> 1.2887956836604E-01
>> (PID.TID 0000.0001) %MON dynstat_salt_del2            =    
>> 2.5964953005350E-03
>> (PID.TID 0000.0001) %MON advcfl_uvel_max              =    
>> 1.4540962524725E-01
>> (PID.TID 0000.0001) %MON advcfl_vvel_max              =    
>> 4.0289845328898E-02
>> (PID.TID 0000.0001) %MON advcfl_wvel_max              =    
>> 2.1666787542532E-01
>> (PID.TID 0000.0001) %MON advcfl_W_hf_max              =    
>> 2.2335820994742E-01
>> (PID.TID 0000.0001) %MON pe_b_mean                    =    
>> 1.2621239225557E-08
>> (PID.TID 0000.0001) %MON ke_max                       =    
>> 4.6011781196852E+00
>> (PID.TID 0000.0001) %MON ke_mean                      =    
>> 4.6615962365024E-06
>> (PID.TID 0000.0001) %MON ke_vol                       =    
>> 9.9365095001242E+17
>> (PID.TID 0000.0001) %MON vort_r_min                   =   
>> -7.2511662603239E-03
>> (PID.TID 0000.0001) %MON vort_r_max                   =    
>> 6.4558430600778E-03
>> (PID.TID 0000.0001) %MON vort_a_mean                  =    
>> 5.0662999065604E-05
>> (PID.TID 0000.0001) %MON vort_a_sd                    =    
>> 7.4966761920324E-07
>> (PID.TID 0000.0001) %MON vort_p_mean                  =    
>> 7.3612614840296E-05
>> (PID.TID 0000.0001) %MON vort_p_sd                    =    
>> 5.9336048842539E-05
>> (PID.TID 0000.0001) %MON surfExpan_theta_mean         =   
>> -9.9981862111201E-08
>> (PID.TID 0000.0001) %MON surfExpan_salt_mean          =   
>> -1.1922103976760E-07
>> (PID.TID 0000.0001) //  
>> =======================================================
>> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
>> (PID.TID 0000.0001) //  
>> =======================================================
>>  cg2d: Sum(rhs),rhsMax =  -4.85042010003101E+00  3.71869645807112E-02
>> (PID.TID 0000.0001)                    cg2d_init_res =     
>> 4.74339332785140E-02
>> (PID.TID 0000.0001)                       cg2d_iters =   126
>> (PID.TID 0000.0001)                         cg2d_res =     
>> 9.36545477468715E-14
>>  cg3d: Sum(rhs),rhsMax =   3.24606069488422E-10  1.20245356700100E-09
>> (PID.TID 0000.0001)                    cg3d_init_res =     
>> 1.02771503214190E-01
>> (PID.TID 0000.0001)                       cg3d_iters =    20
>> (PID.TID 0000.0001)                         cg3d_res =     
>> 3.71423006414093E-03
>> (PID.TID 0000.0001) //  
>> =======================================================
>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>> (PID.TID 0000.0001) //  
>> =======================================================
>> (PID.TID 0000.0001) %MON time_tsnumber                 
>> =                   317
>> (PID.TID 0000.0001) %MON time_secondsf                =    
>> 3.9625000000000E+03
>> (PID.TID 0000.0001) %MON dynstat_eta_max              =    
>> 4.2644526849910E-01
>>
>> And I can not see anything abnormal,(I increased cg3dMaxIters to a  
>> much larger number this afternoon, and the model even stoped with  
>> less time stops).
>> In sum, according to the STDOUT files, the model did not blowup, but  
>> the machine did give an error message, i.e.,
>> [node7:03388] Signal: Floating point exception (8)
>> [node7:03388] Signal code: Floating point divide-by-zero (3)
>> in file "my_job.o1562"
>> This problem never happened when I set coriolis force f0=0, and  
>> initial V velocity to zero. However, when I changed either of these  
>> two parameters, the problem appears.  I'm really confused,,, Do you  
>> think there is a bug in this model or something ?
>> Best Wishes,
>> Dwight
>>
>>
>> ?????????????????????  
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/99737e70/attachment-0001.htm>
>
>------------------------------
>
>Message: 2
>Date: Thu, 03 Jun 2010 11:46:45 -0700
>From: Yuan Lian <lian at ashimaresearch.com>
>To: mitgcm-support at mitgcm.org
>Subject: Re: [MITgcm-support] A very strange problem
>Message-ID: <4C07F895.3080000 at ashimaresearch.com>
>Content-Type: text/plain; charset="gb2312"
>
>The number doesn't seem to be very large ...
>
>Maybe there is truncating error that makes the "divide by zero" occur?
>Have you checked if the initial data files been generated/or read by
>MITgcm correctly, i.e., all initial data files use double precision?
>
>Yuan
>
>
>
>On 6/3/10 11:28 AM, m. r. schaferkotter wrote:
>> try reducing the time step as
>>
>> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
>>
>> could indicate a smaller time step.
>>
>>
>> On Jun 3, 2010, at 4:08 AM, ouc.edu.cn wrote:
>>
>>> Hi Jean-Michel,
>>> Thank you for your reply.
>>> From the 'STDOUT.00xx' files I'm sure the model didn't blow up, cause
>>> everything looks nomal, and all these 32 files are kind of the same
>>> and contain almost the same information. Below is the information of
>>> the last two iterations:
>>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>>> (PID.TID 0000.0001) //
>>> =======================================================
>>> (PID.TID 0000.0001) %MON time_tsnumber = 316
>>> (PID.TID 0000.0001) %MON time_secondsf = 3.9500000000000E+03
>>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2351103155441E-01
>>> (PID.TID 0000.0001) %MON dynstat_eta_min = -2.9095841140222E-01
>>> (PID.TID 0000.0001) %MON dynstat_eta_mean = 2.2266055703883E-05
>>> (PID.TID 0000.0001) %MON dynstat_eta_sd = 2.4506462290779E-03
>>> (PID.TID 0000.0001) %MON dynstat_eta_del2 = 1.3655537149458E-03
>>> (PID.TID 0000.0001) %MON dynstat_uvel_max = 2.9081925049449E+00
>>> (PID.TID 0000.0001) %MON dynstat_uvel_min = -2.3945333213539E+00
>>> (PID.TID 0000.0001) %MON dynstat_uvel_mean = -5.9096408449438E-05
>>> (PID.TID 0000.0001) %MON dynstat_uvel_sd = 3.4343631952692E-03
>>> (PID.TID 0000.0001) %MON dynstat_uvel_del2 = 4.7379871670678E-03
>>> (PID.TID 0000.0001) %MON dynstat_vvel_max = 2.5933981864330E+00
>>> (PID.TID 0000.0001) %MON dynstat_vvel_min = -3.2231876263118E+00
>>> (PID.TID 0000.0001) %MON dynstat_vvel_mean = 1.4348687349928E-06
>>> (PID.TID 0000.0001) %MON dynstat_vvel_sd = 1.2624775236425E-03
>>> (PID.TID 0000.0001) %MON dynstat_vvel_del2 = 3.8178362135143E-03
>>> (PID.TID 0000.0001) %MON dynstat_wvel_max = 2.9846705628259E-01
>>> (PID.TID 0000.0001) %MON dynstat_wvel_min = -2.3853942232519E-01
>>> (PID.TID 0000.0001) %MON dynstat_wvel_mean = -2.1033468915427E-09
>>> (PID.TID 0000.0001) %MON dynstat_wvel_sd = 1.9417325836487E-04
>>> (PID.TID 0000.0001) %MON dynstat_wvel_del2 = 9.3761000810601E-04
>>> (PID.TID 0000.0001) %MON dynstat_theta_max = 3.7963165120273E+01
>>> (PID.TID 0000.0001) %MON dynstat_theta_min = -2.1436513114643E+01
>>> (PID.TID 0000.0001) %MON dynstat_theta_mean = 5.9453514759821E+00
>>> (PID.TID 0000.0001) %MON dynstat_theta_sd = 6.2570503287055E+00
>>> (PID.TID 0000.0001) %MON dynstat_theta_del2 = 2.5262750966653E-02
>>> (PID.TID 0000.0001) %MON dynstat_salt_max = 3.8107582424356E+01
>>> (PID.TID 0000.0001) %MON dynstat_salt_min = 0.0000000000000E+00
>>> (PID.TID 0000.0001) %MON dynstat_salt_mean = 3.4534555031168E+01
>>> (PID.TID 0000.0001) %MON dynstat_salt_sd = 1.2887956836604E-01
>>> (PID.TID 0000.0001) %MON dynstat_salt_del2 = 2.5964953005350E-03
>>> (PID.TID 0000.0001) %MON advcfl_uvel_max = 1.4540962524725E-01
>>> (PID.TID 0000.0001) %MON advcfl_vvel_max = 4.0289845328898E-02
>>> (PID.TID 0000.0001) %MON advcfl_wvel_max = 2.1666787542532E-01
>>> (PID.TID 0000.0001) %MON advcfl_W_hf_max = 2.2335820994742E-01
>>> (PID.TID 0000.0001) %MON pe_b_mean = 1.2621239225557E-08
>>> (PID.TID 0000.0001) %MON ke_max = 4.6011781196852E+00
>>> (PID.TID 0000.0001) %MON ke_mean = 4.6615962365024E-06
>>> (PID.TID 0000.0001) %MON ke_vol = 9.9365095001242E+17
>>> (PID.TID 0000.0001) %MON vort_r_min = -7.2511662603239E-03
>>> (PID.TID 0000.0001) %MON vort_r_max = 6.4558430600778E-03
>>> (PID.TID 0000.0001) %MON vort_a_mean = 5.0662999065604E-05
>>> (PID.TID 0000.0001) %MON vort_a_sd = 7.4966761920324E-07
>>> (PID.TID 0000.0001) %MON vort_p_mean = 7.3612614840296E-05
>>> (PID.TID 0000.0001) %MON vort_p_sd = 5.9336048842539E-05
>>> (PID.TID 0000.0001) %MON surfExpan_theta_mean = -9.9981862111201E-08
>>> (PID.TID 0000.0001) %MON surfExpan_salt_mean = -1.1922103976760E-07
>>> (PID.TID 0000.0001) //
>>> =======================================================
>>> (PID.TID 0000.0001) // End MONITOR dynamic field statistics
>>> (PID.TID 0000.0001) //
>>> =======================================================
>>> cg2d: Sum(rhs),rhsMax = -4.85042010003101E+00 3.71869645807112E-02
>>> (PID.TID 0000.0001) cg2d_init_res = 4.74339332785140E-02
>>> (PID.TID 0000.0001) cg2d_iters = 126
>>> (PID.TID 0000.0001) cg2d_res = 9.36545477468715E-14
>>> cg3d: Sum(rhs),rhsMax = 3.24606069488422E-10 1.20245356700100E-09
>>> (PID.TID 0000.0001) cg3d_init_res = 1.02771503214190E-01
>>> (PID.TID 0000.0001) cg3d_iters = 20
>>> (PID.TID 0000.0001) cg3d_res = 3.71423006414093E-03
>>> (PID.TID 0000.0001) //
>>> =======================================================
>>> (PID.TID 0000.0001) // Begin MONITOR dynamic field statistics
>>> (PID.TID 0000.0001) //
>>> =======================================================
>>> (PID.TID 0000.0001) %MON time_tsnumber = 317
>>> (PID.TID 0000.0001) %MON time_secondsf = 3.9625000000000E+03
>>> (PID.TID 0000.0001) %MON dynstat_eta_max = 4.2644526849910E-01
>>> And I can not see anything abnormal,(I increased cg3dMaxIters to a
>>> much larger number this afternoon, and the model even stoped with
>>> less time stops).
>>> In sum, according to the STDOUT files, the model did not blowup, but
>>> the machine did give an error message, i.e.,
>>> [node7:03388] Signal: Floating point exception (8)
>>> [node7:03388] Signal code: Floating point divide-by-zero (3)
>>> in file "my_job.o1562"
>>> This problem never happened when I set coriolis force f0=0, and
>>> initial V velocity to zero. However, when I changed either of these
>>> two parameters, the problem appears. I'm really confused,,, Do you
>>> think there is a bug in this model or something ?
>>> Best Wishes,
>>> Dwight
>>>
>>>
>>> ------------------------------------------------------------------------
>>> ?????????????????????
>>> <http://ym.163.com/?from=od3>
>>> _______________________________________________
>>> MITgcm-support mailing list
>>> MITgcm-support at mitgcm.org <mailto:MITgcm-support at mitgcm.org>
>>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>
>>
>> _______________________________________________
>> MITgcm-support mailing list
>> MITgcm-support at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-support
>>   
>
>-------------- next part --------------
>An HTML attachment was scrubbed...
>URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100603/17bf6169/attachment.htm>
>
>------------------------------
>
>_______________________________________________
>MITgcm-support mailing list
>MITgcm-support at mitgcm.org
>http://mitgcm.org/mailman/listinfo/mitgcm-support
>
>
>End of MITgcm-support Digest, Vol 84, Issue 4
>*********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mitgcm.org/pipermail/mitgcm-support/attachments/20100606/16a9e089/attachment-0001.htm>


More information about the MITgcm-support mailing list