libsmm: a library for small matrix multiplies.

In order to deal efficiently with small matrix multiplies,
often involving 'special' matrix dimensions such as 5,13,17,22, 
a dedicated matrix library can be generated that outperforms (or matches)
general purpose (optimized) blas libraries.

Generation requires extensive compilation and timing runs, and is machine specific, 
i.e. the library should be constructed on the architecture it is supposed to run.

Users can modify the values inside the file config.in to set which kind of library
they want to generate. Furthermore, they can modify (or add) the files inside
the config directory to set the compiler options used to build the
library. They can use the existing files as template.

There are several options for building the library. Run ./generate -h to see them.
Below you can find the detailed instructions for some examples.

====================================================================================================================
a) How to generate the library running several jobs in a cluster, where each
   node allows for both execution and compilation. 
   For this example we will use a CRAY system with GNU compiler and SLURM.
   Run "./generate -h" to see the meaning of the options.

   1) Run: ./generate -c config/cray.gnu -j 100 -t 16 -w slurm tiny1
      This command submits 100 jobs in batch. Wait until their completion.

   2) Run: ./generate -c config/cray.gnu tiny2
      This command collects all results produced in the tiny1 phase and it
      generates a file tiny_gen_optimal_dnn_cray.gnu.out

   3) As done in 1) and 2), run: ./generate -c config/cray.gnu -j 20 -t 16 -w slurm small1
      This command submits 20 jobs in batch. Wait until their completion.
      Then run: ./generate -c config/cray.gnu small2
      This command collects all results produced in the small1 phase and it
      generates a file small_gen_optimal_dnn_cray.gnu.out
      
   4) Run: ./generate -c config/cray.gnu -t 16 -w slurm lib
      This commman submit in batch a single job that compiles the library.
      At the end the library is produced inside the directory lib/
      (libsmm_dnn_cray.gnu.a). 

   5) It is highly recommended to run the final test to check the correctness of the library.
      Run: ./generate -c config/cray.gnu -j 20 -w slurm check1
      After the batch jobs completion, run: ./generate -c config/cray.gnu -j 20 check2
      Note that it is important to use the same number of jobs specified in
      check1 phase. Finally check test_smm_dnn_cray.gnu.out for performance and correctness.

   6) Intermediate files (but not some key output and the library itself)
      might be removed using ./generate clean


====================================================================================================================
b) How to generate the library running a single job interactively. 
   For this example we will use a Linux system with GNU compiler.
   Run "./generate -h" to see the meaning of the options.

   1) Run: ./generate -c config/linux.gnu -j 10 -t 16 -w none tiny1
      This command generates, compiles and executes the tiny kernels 
      in 10 groups. Please increase the number of groups (-j <#> option)
      if you get the error "Argument list too long".
   
   2) Run: ./generate -c config/linux.gnu tiny2
      This command collects all results produced in the tiny1 phase and it
      generates a file tiny_gen_optimal_dnn_linux.gnu.out

   3) Run: ./generate -c config/linux.gnu -j 0 -t 16 small1
      This command generates a file small_gen_optimal_dnn_linux.gnu.out

   4) Run: ./generate -c config/linux.gnu -j 0 -t 16 -w slurm lib
      This command produces the llibrary inside the directory lib/
      (libsmm_dnn_linux.gnu.a).

   5) It is highly recommended to run the final test to check the correctness
      of the library.
      Run: ./generate -c config/linux.gnu -j 0 -w slurm check1
      Finally check test_smm_dnn_linux.gnu.out for performance
      and correctness.

   6) Intermediate files (but not some key output and the library itself)
      might be removed using ./generate clean

====================================================================================================================
c) How to generate the library for the Intel Xeon Phi in batch mode.

   For this example we will use a cluster with SLURM, where each node has a
   Intel Xeon Phi card.
   Run "./generate -h" to see the meaning of the options.
   We use the config file mic.intel (inside the directory config).
   Check if all options are OK for your case, in particular:
    - the target_compile variable with the flag "-offload-attribute-target=mic". 
    - the target_compile_offload variable with the flag "-offload=mandatory".
    - Set the MIC_OMP_NUM_THREADS variable to the number of cores on the card.

   Note that the library is produced by offloading the kernels to the Xeon
   Phi. Performance output files are written in the same directory where the
   library is executed on the host, therefore this directory must be exported
   to the Xeon Phi with the right permission (read/write).
   
   1) Run: ./generate -c config/mic.intel -j 100 -t 16 -w slurm tiny1
      This command submits 100 jobs in batch. Each job offloads executions
      to the Intel Xeon Phi card (MIC_OMP_NUM_THREADS threads). Wait until
      completion of all jobs. 

   2) Run: ./generate -c config/mic.intel tiny2
      This command collects all results of the tiny1 phase and it generates
      the file tiny_gen_optimal_dnn_mic.intel.out.

   3) As done in 1) and 2), run: ./generate -c config/mic.intel -j 100 -t 16 -w slurm small1
      This command submits 100 jobs in batch, where each job offloads
      executions to the Intel Xeon Phi card (MIC_OMP_NUM_THREADS
      threads). Wait until their completion. Then run: ./generate -c config/mic.intel small2
      This command collects all results produced in the small1 phase and it
      generates a file small_gen_optimal_dnn_mic.intel.out

   4) Run: ./generate -c config/mic.intel -t 16 -w slurm lib
      This commman submit in batch a single job that compiles the library.
      At the end the library is produced inside the directory lib/
      (libsmm_dnn_mic.intel.a). 

   5) It is highly recommended to run the final test to check the correctness of the library.
      Run: ./generate -c config/mic.intel -j 200 -w slurm check1
      After the batch jobs completion, run: ./generate -c config/mic.intel -j 200 check2
      Note that it is important to use the same number of jobs specified in
      check1 phase. Finally check test_smm_dnn_mic.intel.out for performance and correctness.

   6) Intermediate files (but not some key output and the library itself)
      might be removed using ./generate clean


The following copyright covers code and generated library
!====================================================================================================================
! * Copyright (c) 2015 Joost VandeVondele and Alfio Lazzaro
! * All rights reserved.
! *
! * Redistribution and use in source and binary forms, with or without
! * modification, are permitted provided that the following conditions are met:
! *     * Redistributions of source code must retain the above copyright
! *       notice, this list of conditions and the following disclaimer.
! *     * Redistributions in binary form must reproduce the above copyright
! *       notice, this list of conditions and the following disclaimer in the
! *       documentation and/or other materials provided with the distribution.
! *
! * THIS SOFTWARE IS PROVIDED BY Joost VandeVondele ''AS IS'' AND ANY
! * EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
! * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
! * DISCLAIMED. IN NO EVENT SHALL Joost VandeVondele BE LIABLE FOR ANY
! * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
! * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
! * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
! * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
! * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
! * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
! *
!====================================================================================================================

