Dagger.jl for Task Parallel HPC

By Julian Samaroo

What is Dagger?

  • Pure-Julia task parallelism library
  • Unified task API supports multiple high-level APIs (arrays, tables, graphs, streaming, data dependencies)
  • Supports multi-threading, multi-node, multi-GPU out of the box
  • Machine learning-based scheduler optimizes computational DAGs based on learned task behavior
  • Advanced data dependency system automatically parallelizes operations on overlapping data regions (like OpenMP's "task depend" on steroids)

Why Dagger?

  • Highly scalable for multi-CPU, multi-GPU, and multi-node
  • Powerful alternative to libraries like ScaLAPACK, Elemental, PLASMA
  • Custom APIs are easy to build on task API
  • No need for inconsistent vendor-specific compilers
  • Can (eventually) do away with MPI

Highly Scalable

Dagger's Superpower: ??????

# Cholesky

function cholesky!(M, mt, nt)
  for k in range(1, mt)
    LAPACK.potrf!('L', M[k, k])
    for m in range(k+1, mt)
      BLAS.trsm!('R', 'L', 'T', 'N', 1.0,
      			 M[k, k], M[m, k])
    end
    for n in range(k+1, nt)
      BLAS.syrk!('L', 'N', -1.0,
                 M[n, k], 1.0,
                 M[n, n])
      for m in range(n+1, mt)
        BLAS.gemm!('N', 'T', -1.0,
                   M[m, k], M[n, k],
                   1.0, M[m, n])
      end
    end
  end
end

Dagger's Superpower: Datadeps

# Cholesky written with Dagger Datadeps

Dagger.spawn_datadeps() do
  for k in range(1, mt)
    Dagger.@spawn LAPACK.potrf!('L', ReadWrite(M[k, k]))
    for m in range(k+1, mt)
      Dagger.@spawn BLAS.trsm!('R', 'L', 'T', 'N', 1.0,
                               Read(M[k, k]), ReadWrite(M[m, k]))
    end
    for n in range(k+1, nt)
      Dagger.@spawn BLAS.syrk!('L', 'N', -1.0,
                               Read(M[n, k]), 1.0,
                               ReadWrite(M[n, n]))
      for m in range(n+1, mt)
        Dagger.@spawn BLAS.gemm!('N', 'T', -1.0,
                                 Read(M[m, k]), Read(M[n, k]),
                                 1.0, ReadWrite(M[m, n]))
      end
    end
  end
end

What is Datadeps?

  • API for automatic parallelism of serial code
  • Data aliasing/locality aware (CPU, GPU, node, etc.)
  • Handles data transfers automatically and efficiently
  • Supports data larger than RAM/VRAM ("out-of-core")
  • Single algorithm becomes infinitely scalable
  • (Coming Soon) Auto MPI
  • (Coming Soon) Auto multi-precision
  • (Coming Soon) Optimal scheduler

deck

By Julian Samaroo

deck

  • 117