Analyzing the Energy-Time Trade-OFF in high performance computing applications

Presentation done by Kevin Gee

based on Freeh, Lowenthal, Pan, Kappiah, Springer paper

published in IEEE Transactions on parallel and distributed systems,

Vol. 18, No. 6, June 2007

Purpose

Demonstrate results of energy-time trade-off of a wide range of applications - serial and parallel - on a power scalable cluster
Introduce metrics to predict memory and communication bottlenecks

Content

Introduction
Experimental methodology
Single node results
Multi node results
Metrics
Conclusion

Introduction: facts

HPC (High Performance Computing) tends to push performance at all cost

Last 10% increase in performance requires disproportionally large amounts of resources
The Earth Simulator consumes 7 MW of power

INTRODUCTION: Solution

Low-power high-performance clusters to save energy
Green Destiny consumes about 3 times less energy than Accelerated Strategic Computing Initiative (ASCI) Q machine
ASCI_Q machine is about 15 times faster per node

INTRODUCTION: SOLUTION

Using high-performance commodity microprocessor with frequency and voltage scaling
Increase in frequency results in:

a smaller increase in application performance
increase in CPU stalls

Introduction: prediction

It is possible to reduce power consumption without a significant increase in execution time

Methodology: Programs

Single node benchmark sets

Nasa Advanced Supercomputing (NAS)
System Performance Evaluation Cooperative (SPEC) integer
SPEC float

Multiple node benchmark sets

NAS MPI
ASCI

METHODOLOGY: Cluster

10 nodes, connected by 100 Mbps network,

each has:

AMD Athlon 64 CPU
1 Gb main memory
128 Kb L1 cache
512 Kb L2 cache

METHODOLOGY: Gears

Idle and active power for each AMD-64 node

Single node results

Overall results

NAS SPEC INT SPEC FP

SINGLE NODE RESULTS

Best energy-time trade-off in each set with different programs

SINGLE NODE RESULTS

Worst energy-time trade-off using different programs

Multiple node results

Bottlenecks:

Memory
Communication

Multiple node results

Possible layouts of the graph :

Poor speedup - more energy, less time
Perfect or superlinear speedup
Speedup is good and there are significant memory operations

MULTIPLE NODE RESULTS

Poor speedup (more energy, less time):

MULTIPLE NODE RESULTS

Perfect speedup (no energy-time trade-off):

MULTIPLE NODE RESULTS

Good speedup:

Metrics: CPU criticality

Tmax - execution time in maximum frequency

T - execution time in current frequency

fmax - maximum frequency

f - current frequency

if (beta == 0) => application is independent from CPU

(T == Tmax)

METrics: MPO

MPO - Misses Per Operation
Measures memory pressure
MPO = The number of operations retired and the L2 cache misses( which are memory accesses)
Constant while changing frequency
If MPO decreases - application becomes more CPU bound

Metrics: slack

Predicts communication bottleneck
Slack - ratio of the total time a node is blocked in MPI calls to the total execution time
Greater slack suggests reducing gear
Collected on the fly

Conclusion

In the best case on one node, reducing the CPU speed makes it possible to use 20 percent less energy while increasing time by only 3 percent.

On the other hand, a program that is largely CPU bound should be run at the fastest gear possible.

MPO and slack can be used to to choose appropriate gear.

Your Questions