CUDA

Syed Asad Zaman

BS Computer Science

FAST National University of Computing & Emerging Sciences

https://essayyzed.net

Parallel & Distrubted Computing

Abstract

CPU

small no. of cores
thread creation expensive
no more than 12 Threads

GPU

large no. of cores
768 CUDA Cores
used for graphics rendering mostly
Divide & Conquer approach

Question Arises?

How to do general-purpose tasks on GPU?

Answer

CUDA Programming

Architecture

Requirements

Nivida GPU
CUDA
pyCUDA

Installation

sudo apt install nvidia-cuda-toolkit

Sample Code

#include <stdio.h>

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %f\n", maxError);

  cudaFree(d_x);
  cudaFree(d_y);
  free(x);
  free(y);
}

nvcc -o hello hello.cu && ./hello

CUDA

Parallel & Distrubted Computing

Abstract

CPU

GPU

Question Arises?

Answer

CUDA Programming

Architecture

Requirements

Installation

Sample Code

Demo

Title Text

Subtitle

CUDA

CUDA

Syed Asad Zaman

CUDA

Parallel & Distrubted Computing

Abstract

CPU

GPU

Question Arises?

Answer

CUDA Programming

Architecture

Requirements

Installation

Sample Code

Demo

Title Text

Subtitle

CUDA

More from Syed Asad Zaman