Kian Paimani
Zakarias Nordfäldt-Laws
June 2018
PERFORMANCE ENGINEERING
Recap: Gradient Descent
Recap: Goals
Reference Implementation
for (i=1; i<iterations+1; i++){
for(z=0; z<length; z+=batch_size) {
update_gradient_batch(/* ... */);
for (n = 0; n < dim; n++) {
/* Eventually, update the weight */
par->weights[n] += (alpha * m_hat) / (sqrt(r_hat) + eps);
}
}
}
void update_gradients_batch(){
for(i=start; i<start+batch_size; i++){
for (n=0; n<dim; n++) {
/* 1. Make a prediction */
/* 2. Compute error */
/* 3. Calculate gradient using the cost function */
}
}
}
Analytical Model
Analytical Model
Iteration over all data
Setup
Finalize
Small Batch_size
Large Batch_size
CPU Based Parallelization
/* Initial Version - Generic Gradient Descent */
void gradient_descent(struct parameters *par);
void stochastic_gradient_descent(struct parameters *par);
/* ADAM Versions */
void adam(struct parameters *par);
void adam_seq_opt(struct parameters *par);
void adam_data_opt(struct parameters *par);
void adam_omp(struct parameters *par);
void adam_omp_simd(struct parameters *par);
CPU Based Parallelization
adam_seq_opt()
Loop unrolling, Function removal, Code motion etc.
adam_data_opt()
cache optimization by better data access pattern.
CPU Based Parallelization
adam_omp()
2 loops exposed
Iteration over all data
Setup
Finalize
Setup
Finalize
CPU Based Parallelization
adam_omp_simd()
for (i=1; i<iterations+1; i++){
for(z=0; z<length; z+=MIN(batch_size, length-z)) {
/* ... */
#pragma omp parallel
{
#pragma omp for
for(n=z; n<MIN(z+batch_size, length); n++){
for (a=0; a<dim-7; a+=8) { /* Vectorized Execution: Calculate Guess */ }
error = par->Y[n] - guess;
for (a=0; a<dim-7; a+=8){ /* Vectorized Execution: Update Gradients */ }
}
#pragma omp critical
{
/* Vectorized Execution: Reduction */
}
#pragma omp barrier
#pragma omp for schedule(static) private(n, m_hat, r_hat)
for (n=0; n<dim; n++) { /* Update weights */}
}
}
}
GPU Based Parallelization
/* GPU Versions */
void adam_cuda_global_mem(struct parameters *par);
void adam_cuda_global_mem_unrolled(struct parameters *par);
void adam_cuda_shared_mem(struct parameters *par);
void adam_cuda_shared_mem_stream(struct parameters *par);
void adam_cuda_shared_mem_stream_pinned(struct parameters *par);
void adam_cuda_shared_mem_stream_pinned_unrolled(struct parameters *par);
GPU Based Parallelization
adam_cuda_global_mem()
Setup
Finalize
Implies:
Bigger batches => Better Speedup
GPU Based Parallelization
adam_cuda_shared_mem()
GPU Based Parallelization
GPU Based Parallelization
adam_cuda_shared_mem()
GPU Based Parallelization
adam_cuda_shared_mem_stream()
Ratio: 58 / 42
GPU Based Parallelization
adam_cuda_shared_mem_stream_pinned_unrolled()
GPU Based Parallelization
adam_cuda_shared_mem_stream_pinned_unrolled()
Results: Base
Results: Optimized
Modeling: Some Explanation
Modeling: Some Explanation
Modeling: Some Explanation
Memory
Bounded