[MITgcm-devel] solve_tri/pentadiagonal and adjoint

Thu Oct 13 10:54:29 EDT 2011

Done,

my performance rambling is totally unqualified, but I talked to people who address the performance issues associated with so-called band-limitation (typical for codes with few operations per memory access, such as the MITgcm) by making the k-loops the innermost loops. The effect is supposedly that you access the core memory for each (i,j)-point and since k-loops are typically short the local cache is big enough to keep the data for the entire k-loop, so that you can evaluate, say, horizontal differences without accessing the core memory for every (i,j). Does that make sense? I am not sure if it also useful for the vertical solvers, though.

Martin

On Oct 13, 2011, at 4:29 PM, Jean-Michel Campin wrote:

> Hi Martin,
> 
> I propose that you check-in this one. And later-on, if it's clear
> that one of the 3 versions is of no use, we will clean it up 
> (but I have to admit that I was not convinced by the performance 
> gain you mentioned relative to the k-loop inside version).
> (and may be in the mean time Gael will have read his email ?)
> 
> Cheers,
> Jean-Michel
> 
> On Thu, Oct 13, 2011 at 04:17:05PM +0200, Martin Losch wrote:
>> OK, now I have working code for solve_pentadiagonal and solve_tridiagonal, that contains
>> 
>> a) the original version (if no extra flags are specified, original with the exception of one step in order to avoid too many lines of different code)
>> b) the k-loop inside version (requires specifying ifdef ALLOW_SOLVERS_KLOOPINSIDE, somewhere outside these routines. It means that the ECCO set ups that are using this option already should add this flag to their CPP_OPTIONS.h. I understand that that can be annoying)
>> c) my new version with up to 5 more 3-D local arrays (only turned on with the combination ALLOW_AUTODIFF && TARGET_NEC_SX)
>> 
>> Does that sound good to you?
>> 
>> Martin
>> 
>> 
>> 
>> On Oct 13, 2011, at 3:18 PM, Martin Losch wrote:
>> 
>>> Hi Jean-Michel,
>>> 
>>> yes, that was my plan. For solve_tridiagonal it is just one extra 3d-array (similar to impldiff.F) but for solve_pentadiagonal it is 5. I'll try to fix this, before checking in.
>>> 
>>> M.
>>> 
>>> On Oct 13, 2011, at 3:06 PM, Jean-Michel Campin wrote:
>>> 
>>>> Hi Martin,
>>>> 
>>>> to recap:
>>>> a) original version
>>>> b) k-loop inside
>>>> c) your new version with (5 more 3-D local arrays)
>>>> 
>>>> so (c) will replace (a) ?
>>>> and if (in forward) I would like to reduce memory usage but keep 
>>>> vectorisation ?
>>>> 
>>>> Cheers,
>>>> Jean-Michel
>>>> 
>>>> On Thu, Oct 13, 2011 at 10:38:27AM +0200, Martin Losch wrote:
>>>>> Hi there,
>>>>> 
>>>>> I am going to check-in modified solve_tri/pentadiagonal where I remove the hard-wiring of the CPP flag ALLOW_SOLVER_KLOOPINSIDE to #ifdef ALLOW_AUTODIFF
>>>>> Instead I found a way to make the adjoint work with the original code (it does require more memory, 5 additional local 3d fields, than Gael's solution), that leads to vectorizable adjoint code.
>>>>> 
>>>>> I find the flag ALLOW_SOLVER_KLOOPINSIDE very useful, probably we need to think about a flag like this to move more k-loops "inside" everywhere in the code in order to do more computations out of the local cache with these multi-processor (band-limited) chips. But I also think that this flag should be set outside individual routines and not hardwired to other flags.
>>>>> 
>>>>> Any objections?
>>>>> 
>>>>> Martin
>>>>> PS. there are no tests in verification that test this code, unfortunately.
>>>>> 
>>>>> On Oct 10, 2011, at 3:09 PM, Martin Losch wrote:
>>>>> 
>>>>>> Hi Gael (and others),
>>>>>> 
>>>>>> I know I am a little exotic in that I always want to have vectorizable (adjoint) code. Here's another sequel of this story:
>>>>>> 
>>>>>> in solve_tridiagonal and solve_pentadiagonal, you introduced code that moves the k-loop inside of the i/j-loops in the case of ALLOW_TAMC_AUTODIFF. Your check-in comment (Aug,2010) was:
>>>>>> Adjoint related modifications -- allowing the use of implicit vertical advection in adjoint model.
>>>>>> Can you remember, why this is necessary? The recomputations seem to be OK, but my experiment blows up. Where does it go wrong and maybe there is an alternative way to fix it?
>>>>>> 
>>>>>> Martin
>>>>>> 
>>>>>> PS. You can image that moving the k-loop inside the i/j-loops is terrible for performance on a vector computer, but it also makes even worse (from a vectorization point of view) adjoint code, so that effectively I cannot use the implVertAdv in adjoint mode if it only works like this.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> MITgcm-devel mailing list
>>>>>> MITgcm-devel at mitgcm.org
>>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> MITgcm-devel mailing list
>>>>> MITgcm-devel at mitgcm.org
>>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>>> 
>>>> _______________________________________________
>>>> MITgcm-devel mailing list
>>>> MITgcm-devel at mitgcm.org
>>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>>> 
>>> 
>>> _______________________________________________
>>> MITgcm-devel mailing list
>>> MITgcm-devel at mitgcm.org
>>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
>> 
>> 
>> _______________________________________________
>> MITgcm-devel mailing list
>> MITgcm-devel at mitgcm.org
>> http://mitgcm.org/mailman/listinfo/mitgcm-devel
> 
> _______________________________________________
> MITgcm-devel mailing list
> MITgcm-devel at mitgcm.org
> http://mitgcm.org/mailman/listinfo/mitgcm-devel