Improving Performance by Aligning Data

The vectorizer can generate faster code when operating on aligned data. In this activity you will improve the vectorizer performance by aligning the arrays a, b, and c in driver.f90 on a 16-byte boundary so the vectorizer can use aligned load instructions for all arrays rather than the slower unaligned load instructions and can avoid runtime tests of alignment. Using the ALIGNED macro will insert an alignment directive for a, b, and c in driver.f90 with the following syntax:

!dir$attributes align : 16 :: a,b,c

This instructs the compiler to create arrays that it are aligned on a 16-byte boundary, which should facilitate the use of SSE aligned load instructions.

In addition, the column height of the matrix a needs to be padded out to be a multiple of 16 bytes, so that each individual column of a maintains the same 16-byte alignment. In practice, maintaining a constant alignment between columns is much more important than aligning the start of the arrays.

To derive the maximum benefit from this alignment, we also need to tell the vectorizer it can safely assume that the arrays in matvec.f90 are aligned by using the directive

!dir$ vector aligned

Note

If you use !dir$ vector aligned, you must be sure that all the arrays or subarrays in the loop are 16-byte aligned. Otherwise, you may get a runtime error. Aligning data may still give a performance benefit even if !dir$ vector aligned is not used. See the code under the ALIGNED macro in matvec.f90

If your compilation targets the Intel® AVX instruction set, you should try to align data on a 32-byte boundary. This may result in improved performance. In this case, !dir$ vector aligned advises the compiler that the data is 32-byte aligned.

Recompile the program after adding the ALIGNED macro to ensure consistently aligned data:

ifort -real-size 64 -vec-report2 -DALIGNED matvec.f90 driver.f90 -o MatVector

matvec.f90(32) (col. 3): remark: LOOP WAS VECTORIZED.
matvec.f90(33) (col. 3): remark: loop was not vectorized: not inner loop.
matvec.f90(38) (col. 6): remark: LOOP WAS VECTORIZED.
driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop.
driver.f90(59) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient.
driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop.
driver.f90(59) (col. 5): remark: loop was not vectorized: subscript too complex.
driver.f90(59) (col. 5): remark: loop was not vectorized: not inner loop.
driver.f90(59) (col. 5): remark: LOOP WAS VECTORIZED.
driver.f90(61) (col. 5): remark: loop was not vectorized: vectorization possible but seems inefficient.
driver.f90(61) (col. 5): remark: LOOP WAS VECTORIZED.
driver.f90(63) (col. 21): remark: loop was not vectorized: not inner loop.
driver.f90(63) (col. 21): remark: LOOP WAS VECTORIZED.
driver.f90(80) (col. 29): remark: LOOP WAS VECTORIZED.
driver.f90(74) (col. 7): remark: loop was not vectorized: nonstandard loop is not a vectorization candidate.