Programming

Multi-core

and

Many-core Systems

Assignment 1

Jonas Theis

Kian Paimani

February 2018

Goals

Implement an optimized sequential version
Compile it with different flags and tools
Implement a manually vectorized version using Intel SSE
Compare, benchmark and evaluate both

Sequential Version

Code Motion - Share common expression

for (int row_idx = 0; row_idx < N; row_idx++) {
    for (int col_idx = 0; col_idx < M; col_idx++) {
        matrix[row_idx*M + col_idx] = computed_new_value;
    }
}

int current_row_idx; 
for (int row_idx = 0; row_idx < N; row_idx++) {
    current_row_idx = row_idx*M
    for (int col_idx = 0; col_idx < M; col_idx++) {
        matrix[current_row_idx + col_idx] = computed_new_value; 
        // update diff etc.
        }
    }
}

Sequential Version

Code Motion - Share common expression

int current_row_idx, current_cell_index; 
for (int row_idx = 0; row_idx < N; row_idx++) {
    current_row_idx = row_idx*M
    for (int col_idx = 0; col_idx < M; col_idx++) {
        current_cell_index = current_row_idx + col_idx
        matrix[current_cell_index] = computed_new_value; 
        // update diff etc.
        int diff = matrix[current_cell_index] - o_matrix[current_cell_index]
        }
    }
}

Sequential Version

Avoiding Optimization Blockers
- Function calls
- Conditions

for (int row_idx = 0; row_idx < N; row_idx++) {
    for (int col_idx = 1; col_idx < M-1; col_idx++) {
        matrix[row_idx*M + col_idx] = computed_new_value; 
    }
	// do cell update for col_idx = 0
	matrix[row_idx*M] = computed_new_value; 
	// do cell update for col_idx = M-1
	matrix[row_idx*M + M-1] = computed_new_value; 
}

Sequential Version

Avoiding Optimization Blockers
- Removing another subtle conditional statement

// diff = t_surface_index - temp_index;
// abs_diff = diff > 0 ? diff : diff * -1;

union Abs {
  double d;
  long i;
};
const long abs_bitmask = ~0x8000000000000000;
    
// calculate absolute diff between new and old value
abs_diff.d = t_surface_index - temp_index;
abs_diff.i = abs_bitmask & abs_diff.i;

Sequential Version

Memory Access Pattern

int current_row_idx, current_cell_index; 
for (int row_idx = 0; row_idx < N; row_idx++) {
    current_row_idx = row_idx*M
    for (int col_idx = 0; col_idx < M; col_idx++) {
        current_cell_index = current_row_idx + col_idx
        matrix[current_cell_index] = computed_new_value; 
        // update diff etc.
        int diff = matrix[current_cell_index] - o_matrix[current_cell_index]
        }
    }
}

Sequential Version

Memory Access Pattern

    int current_row_idx, current_cell_index, temp_value; 
    for (int row_idx = 0; row_idx < N; row_idx++) {
    	current_row_idx = row_idx*M
    	for (int col_idx = 0; col_idx < M; col_idx++) {
            current_cell_index = current_row_idx + col_idx
            temp_value = computed_new_value; 
            
            // update diff etc.
            int diff = temp_value - o_matrix[current_cell_index]
            
            // write to memory once
            matrix[current_cell_index] = temp_value
        }
    }

SIMD Version

Evaluations

Sequential - Different size

Evaluations

Sequential Different Compilers

Evaluations

Vectorized version

Conclusions

There is a noticeable difference between the choice of compiler version.
Recent versions of compilers do a better job at optimizing the code.
- While our vectorized version could outperform gcc4 and icc, gcc6 still produced a faster executable for the sequential version.
failed attempt at vectorizing a sequence of additions => SSE can handle only up to 2 double values. Not always worth the time!

pmms-1

By Kian Peymani

pmms-1

7 years ago
514

Kian Peymani

kianenigma

Goals

Sequential Version

Sequential Version

Sequential Version

Sequential Version

Sequential Version

Sequential Version

SIMD Version

Evaluations

Evaluations

Evaluations

Conclusions

pmms-1

More from Kian Peymani