FlowPM
IDRIS Hackathon
Team CosmoStat
What are we trying to do?
 The Universe is difficult to describe analytically, for many applications in Cosmology, we need to simulate it!
 In our case we rely on a class of simple Nbody simulations, based on a ParticleMesh scheme. We alternate between:
 Interpolating particles on a 3D grid
 Computing forces on the grid by 3D Fast FourierTransform
Example of reconstructing the initial conditions of the Universe by gradient descent
FlowPM: Distributed PM Solver in TensorFlow
import tensorflow as tf
import numpy as np
import flowpm
cosmo = flowpm.cosmology.Planck15()
stages = np.linspace(0.1, 1.0, 10, endpoint=True)
initial_conditions = flowpm.linear_field(32, # size of the cube
100, # Physical size of the cube
ipklin, # Initial power spectrum
batch_size=16)
# Sample particles
state = flowpm.lpt_init(cosmo, initial_conditions, a0=0.1)
# Evolve particles down to z=0
final_state = flowpm.nbody(cosmo, state, stages, 32)
# Retrieve final density field
final_field = flowpm.cic_paint(tf.zeros_like(initial_conditions), final_state[0])
FlowPM on singleGPU
Scaling up: Mesh TensorFlow
 With a single GPU, 16GB of RAM can run an approximate 128^3 simulation => Far from enough for practical applications
 We adopted the Mesh TensorFlow framework for model parallelism.
#A simple 2 layer fully connected network
#y=Relu(xw+bias)v
#Define dimensions
batch = mtf.Dimension("batch", b)
io = mtf.Dimension("io", d_io)
hidden = mtf.Dimension("hidden", d_h)
# x.shape == [batch, io]
#Get tensors
w = mtf.get_variable("w", shape=[io, hidden])
bias = mtf.get_variable("bias", shape=[hidden])
v = mtf.get_variable("v", shape=[hidden, io])
#Define operations
h = mtf.relu(mtf.einsum(x, w, output_shape=[batch, hidden]) + bias)
y = mtf.einsum(h, v, output_shape=[batch, io])
batch = mtf.Dimension("batch", b)
io = mtf.Dimension("io", d_io)
hidden = mtf.Dimension("hidden", d_h)
#Define mesh dimensions and layout
mesh_shape = [("processor_rows", 2), ("processor_cols", 2)]
layout_rules = [("batch", "processor_rows"),
("hidden", "processor_cols")]
And it works!
Comparison of distributed simulation on 4 GPUs
But... we are hitting performance and scalability issues
The problem with Mesh TensorFlow on GPU clusters
 Mesh TensorFlow is designed for Google's TPU pods.
 SIMD mesh implementation relying on TPUspecific collectives
 SIMD mesh implementation relying on TPUspecific collectives
 For GPUs, Mesh TensorFlow ships with what's called "DevicePlacementMeshImpl"
gpus = tf.config.experimental.list_logical_devices('GPU')
if gpus:
# Replicate your computation on multiple GPUs
c = []
for gpu in gpus:
with tf.device(gpu.name):
a = tf.constant([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0]])
c.append(tf.matmul(a, b))
with tf.device('/CPU:0'):
matmul_sum = tf.add_n(c)
print(matmul_sum)
Limitations of Device Placement

How it works:
 One master TensorFlow process defines and runs the compute graph

One client TensorFlow process per node just exposes its GPUs over network

Limitations:

A very large and complex graph is executed by a single process
 Interprocess comms happen through the gRPC layer

A very large and complex graph is executed by a single process
# Resolve the cluster from SLURM environment
cluster = tf.distribute.cluster_resolver.SlurmClusterResolver(
{"mesh": mesh_shape.size // FLAGS.gpus_per_task},
port_base=8822,
gpus_per_node=FLAGS.gpus_per_node,
gpus_per_task=FLAGS.gpus_per_task,
tasks_per_node=FLAGS.tasks_per_node)
cluster_spec = cluster.cluster_spec()
# Create a server for all mesh members
server = tf.distribute.Server(cluster_spec, "mesh",
cluster.task_id)
# Only he master job takes care of the graph building,
# everyone else can just chill for now
if cluster.task_id > 0:
server.join()
# Otherwise we are the main task, let's define the devices
mesh_devices = [
"/job:mesh/task:%d/device:GPU:%d" % (i, j)
for i in range(cluster_spec.num_tasks("mesh"))
for j in range(FLAGS.gpus_per_task)
]
print("List of devices", mesh_devices)
mesh_impl = mtf.placement_mesh_impl.PlacementMeshImpl(
mesh_shape, layout_rules, mesh_devices)
Illustration of timings for 3D FFTs
The goal: SPMD Mesh TensorFlow using NCCL
 Level I: Use Horovod as the interface layer between NCCL and TensorFlow
 Level II: Implement a Mesh TensorFlow backend on modified Horovod
 Level III:Implement and Benchmark distributed FlowPM on new Mesh TF
Level I: Horovod
 Since late last year Horovod includes all2all collectives based on NCCL. BUT, only supports one global communicator.
 With Myriam and Jin, we worked on a PR to include support for multiple communicators.
from mpi4py import MPI
import horovod.tensorflow as hvd
# This will be our baseline world communicator
comm = MPI.COMM_WORLD
# Split COMM_WORLD into subcommunicators
subcomm = MPI.COMM_WORLD.Split(color=MPI.COMM_WORLD.rank % 2,
key=MPI.COMM_WORLD.rank)
# Initialize horovod with an array of communicators
hvd.init(comm=[comm, subcomm])
[....]
# AlltoAll over the WORLD communicator
hvd.alltoall(a, communicator_id=0)
# AlltoAll over second (splitted) communicator
hvd.alltoall(a, communicator_id=1)
Level I goals
 Further test implementation (check for deadlocks in particular), and try to get PR approved for merging in Horovod.
 [Very Optional] Implement missing collectives like shifts along one mesh axis
Level II: Mesh TensorFlow
 We already have a prototype here of a Horovodbased Mesh TensorFlow backend, already runs but with limited functionality)
mesh_impl = HvdSimdMeshImpl(mtf.convert_to_shape(mesh_shape),
mtf.convert_to_layout_rules(layout_rules))
# And internally, HvdSimdMeshImpl essentially does:
# Use MPI to define needed communicators
cart_comm = MPI.COMM_WORLD.Create_cart(dims=dims,
periods=[False]*len(dims))
[...]
# Initializing horovod with our set of communicators
hvd.init(comm=[c for _, c in comms.items()])
# When collectives are called:
t = hvd.alltoall(t, communicator_id=self._comms_id[name_dim])
Level II goals
 Finish and test Mesh TF Horovod implementation
 Benchmark and optimize main ops needed for simulation: FFT3D and halo exchange
GPU utilisation for 3D FFT in device placement Mesh TF
Level III: FlowPM and Mesh TF applications
 We have never have been able to run FlowPM simulations on grids larger than 1024^3, so we want to test the scaling of our endapplication under the new backend
 Some recent functionalities of FlowPM are not yet distributed, in particular Gravitational Lensing. They should be reimplemented in Mesh TF, and tested and optimized for the new backend
 Our lab is also working on 3D brain imaging with NeuroSpin, we have a prototype Mesh TF reconstruction code that can be transfered to new backend
Level III goals
 Benchmark final application
 Implement missing Mesh TF FlowPM components
 [Bonus goal] Proof of Concept of 3D brain imaging with Mesh TF model parallelism on new backend
IDRIS Hackathon Intro
By eiffl
IDRIS Hackathon Intro
 720