The Idea of the LAMA Interface

The following code shows an example of a subroutine that inverts the elements of a LAMA array.

template<typename T>
void invert( LAMAArray<T>& values );
{
    HostWriteAccess<T> wValues( values );
    T one = static_cast<T>( 1.0 );
#pragma omp parallel for schedule(LAMA_OMP_SCHEDULE)
    for ( IndexType i = 0; i < wValues.size(); ++i )
    {
        wValues[ i ] = one / wValues[ i ];
    }
}

This subroutine is always executed on the Host and the data must always be transferred to it. In a first step, the algorithm itself should be isolated from the data transfers.

template<typename T>
void OpenMPUtils::invert( T array[], const IndexType n )
{
    T one = static_cast<T> ( 1.0 );
#pragma omp parallel for schedule( LAMA_OMP_SCHEDULE )
    for ( IndexType i = 0; i < n; ++i )
    {
        array[ i ] = one / array[ i ];
    }
}

template<typename T>
void invert( LAMAArray<T>& values );
{
    HostWriteAccess<T> wValues( values );
    T one = static_cast<T>( 1.0 );
    OpenMPUtils::invert( wValues.get(), size );
}

When using other devices like the GPU via CUDA, a corresponding CUDA routine might be called.

template<typename T>
void invertCUDA( LAMAArray<T>& values );
{
    ContextPtr loc = ContextFactory::getContext( Context::CUDA );
    WriteAccess<T> wValues( values, loc );
    T one = static_cast<T>( 1.0 );
    CUDAUtils::invert( wValues.get(), size );
}

As this would result in either many different subroutines or in many case statements that will always be dependent on the number of supported devices, LAMA uses an interface for each supported device.

template<typename T>
void invertCUDA( LAMAArray<T>& values );
{
    ContextPtr loc = values.getValidContext();
    WriteAccess<T> wValues( values, loc );
    T one = static_cast<T>( 1.0 );
    loc->getInterface().Utils.Math.invert<T>() ( wValues.get(), size );
}

This code version will run on each device and has not to be changed if future versions will support more and other devices.

The LAMA Interface

For each type of context there is a different object of the class LAMAInterface.

    const LAMAInterfac& interface loc->getInterface();

The LAMAInterface itself is structured in many different parts that group a number of methods implementing some kind of functionality. The method of a group is selected by its name and by the value type for the data for which it is implemented. The value type is specified as a template argument.

    Utils.Math<T>.invert invert = interface.Utils.invert<T>();

The variable invert is a function pointer. The type of this function is Utils.Math<ValuteType>.invert. The structure Math is required due to the fact that C++ does not support directly template types. The getter routine for the actual value has exactly the same name.

As this use is rather complicated, we provide a macro.

    LAMA_INTERFACE_FN_T( invert, loc, Utils, Math, T);
    invert( wValues.get(), size );

Using of Default Interface

As a consequence, the support of a new device or context in LAMA would imply that each method provided by the interface has to be implemented for this device. Due to the rather high number of routines this would be a very tedious task. Therefore we provide a fallback that will use default routines if a new routine has not been implemented yet.

If an interface does not support a certain routine, the getter routine for the method will return a NULL pointer. Actually, the macro LAMA_INTERFACE_FN_T also verifies that a method exists on the device and throw an exception otherwise. But instead of throwing an exception, it is possible to execute the routine on another device that supports this method. At least each routine is provided on the Host itself.

    LAMA_INTERFACE_FN_DEFAULT_T( invert, loc, Utils, Math, T);
    invert( wValues.get(), size );

So this macro would not throw an exception but instead of it search for another location where it can be executed. Its use has to take into account that the variable loc for the location is modified. So the final solution of the implementation for the invert routine is as follows:

template<typename T>
void invertCUDA( LAMAArray<T>& values );
{
    ContextPtr loc = values.getValidContext();
    LAMA_INTERFACE_FN_DEFAULT_T( invert, loc, Utils, Math, T);
    WriteAccess<T> wValues( values, loc );
    T one = static_cast<T>( 1.0 );
    LAMA_CONTEXT_ACCESS( loc );
    invert( wValues.get(), size );
}

If a subroutine uses mutliple interface routines, it might happen that the first function is available and the second one not. In this case the first function pointer must be replaced with a new one if for the second one another location has been chosen.

Todo:: Handling for subroutines that use multiple interface routines

Function Groups of the Interface

BLAS1 ( lama::BLAS1Interface )
BLAS2 (lama::BLAS2Interface)
BLAS3
Utils ( lama::UtilsInterface )
CSRUtils ( lama::CSRUtilsInterface )
ELLUtils
DIAUtils
JDSUtils
COOUtils
DenseUtils

OpenMP Impelmentations

For the Host interface LAMA provides implementation of all routines parallelized with OpenMP.

CUDA Impelmentations

For the CUDA interface LAMA provides implementations for the most important routines.

BLAS1
BLAS2
BLAS3
Utils
ELLUtils
JDSUtils: only some conversins and matrix-vector multiplication

Asyncrhonous Execution

Other devices might provide more convenient solutions for asynchronous execution than using a separate thread. This is especially true for CUDA where CUDA kernesl can be executed asynchronously in streams.

Todo:: Describe asynchronous execution