Characterizing Performance

and Power Efficiency

on CloudLab

Dmitry Duplyakin

University of Colorado at Boulder

 

dmitry.duplyakin@colorado.edu

Supercomputing 2015, 11/18/2015

About CloudLab

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Mission: cloud testbed with complete control, visibility and scientific fidelity
  • Available now: 2520 ARM cores, 2160 x86 cores
  • Highly heterogeneous hardware
  • More on hardware: https://www.cloudlab.us/hardware.php

Outline

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Background
  • Motivation for using configuration management
  • Chef as a powerful building block
  • Power analysis
  • Performance analysis
  • Useful topologies
  • Summary

 Performance and Power Analysis: Questions?

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • How to provide consistency, transparency, repeatability?

  • What are the right building blocks?

  • Platform-wide, experiment-wide, node-wide?

  • Useful topologies and recommended user practices?

    • start with a provided profile or extend an existing profile? 

Performance Analysis: Challenges

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Performance analysis is not trivial
    • Well-known in the Supercomputing community
  • Often done in the ad hoc manner
  • Single-node and multi-node techniques are different 
  • CloudLab: 3 different platforms

    • ​Ivy Bridge, Haswell, ARMv8

 Power Analysis: Challenges

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Extremely platform-specific

  • Collecting is prone to failures, no debugging

  • Raw data: missing data, noise, unknown granularity

  • Need mechanisms for validation

  • From power to energy: need appropriate numerical integration

  • Need different experiment-wide and platform-wide analysis tools

Performance and Power in the Context of CloudLab

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Initial experience: new profile for a new cause
    • Rapid development, no conflicts  
  • Scalability problems:
    • duplication of work
    • rapid growth of the number of profiles and images 
  • Can’t "merge" images
  • Can't experiment with different combinations of tools

Key Proposals

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Invest time and effort in code (not images)
  • Avoid ad hoc scripting
  • Use a configuration management system as a tool for enabling consistency, transparency, repeatability
  • Leverage previous experience with Chef and evaluate it on CloudLab
  • Employ the latest features of Chef: e.g., reporting and push jobs  

Benefits of Using Chef

Dmitry Duplyakin, University of Colorado

  • Nice structure: roles, cookbooks, recipes, nodes, etc.
  • Idempotence
  • All artifacts are code: reusable and extendable
  • Over 2,000 community-developed cookbooks at Supermarket
  • Vision: single codebase for all sites within CloudLab

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Configuration Management with Chef: Architecture

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Chef Terminology

Dmitry Duplyakin, University of Colorado

  • Client -- node that is being configured
  • Server -- node that performs configuration
  • Workstation (knife) -- management utility
    • Can run on the server; multiple workstations are allowed     

 

  • Recipe = script
  • Cookbook -- collection of recipes
  • Role -- collection of cookbooks, typically assigned to clients
  • Environment -- cluster, aggregation, region, etc.
  • Attributes - variables set for specific nodes, environments, etc.

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Chef: Closer Look

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Demo:

Chef: Push Jobs

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

name “get_power"
description "Role applied to nodes that need to download power data"
override_attributes(
  "push_jobs" => {

    "whitelist" => {
      “get_power" => "cd /tmp ; git clone https://github.com/dmdu/power-client.git ;\
                      /bin/bash -x power-client/power-client.sh -s clemson -l 12h"
    }

  }
)
run_list [ "push-jobs" ]
  • Push job wrapped in a role:
  • Submit role:

# knife role from file get_power.rb
  • Assign to a node:

# knife node run_list add head "role[get_power]"
  • Run the job:

# knife job start get_power head

Started.  Job ID: 3f0ae42b88ea60365f7d07c64e30ff54

Running (1/1 in progress) ...

Running (0/1 in progress) ...

Big Picture with Chef

Dmitry Duplyakin, University of Colorado

  • From individual software tools to toolchains coordinated by Chef
  • Tradeoff: additional complexity, dependence on Chef
  • Argument 1: long-term investment
  • Argument 2: Chef takes stability seriously 
    • Example: procedure for installing Chef - a Chef cookbook - an artifact that is subject to rigorous testing
  • Argument 3: Chef Zero allows running cookbooks with no additional overhead, without a server/workstation:
    • chef-client -z -o <name of the cookbook>

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Evolution of Chef on CloudLab

Dmitry Duplyakin, University of Colorado

  • Prerequisite: experiment-wide keys
  • Created a profile with Chef 11 
  • Upgraded to Chef 12 and enabled push jobs and reporting
  • Scripted profile creation and added parameters
  • Added support for ARM 64-bit
  • Future: make integration with existing profiles transparent 

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power Analysis

Dmitry Duplyakin, University of Colorado

Site, Harware CPU Power, Frequency
Wisconsin,
Cisco UCS C220 M4
 
Two Intel E5-2630 v3 8-core CPUs at
2.40 GHz (Haswell w/ EM64T)
TDP: 85 W
Turbo: 3.2 GHz
Clemson,
Dell PowerEdge C8220
 
Two Intel E5-2660 v2 10-core CPUs at
2.20 GHz (Ivy Bridge)
TDP: 95 W
Turbo: 3 GHz
Utah,
HP ProLiant m400
 
One ARMv8 64-bit (Atlas/A57) 8-core CPU at
2.4 GHz (APM X-GENE)
TDP ?
No frequency scaling

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Developed supporting cookbooks:
    • emulab-R, emulab-shiny, emulab-powervis 

Sources of Power Data

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

IOUT_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x8C

VIN_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x88

IOUT = int(IOUT_hex,16) * 0.01239 - 25.3717
VIN = int(VIN_hex,16) * 0.005208

POWER = VIN*IOUT
  • On-node power (available on ARM at Utah only):
  • Chassis Manager (CM) power data (available for all 3 sites)
    • Collectors and database managed by Emmanuel Cecchet  
    • power-client.sh -s clemson -l 12h

Power Analysis on ARM - Workload 1

Dmitry Duplyakin, University of Colorado

Workload 1 - Gradual Load

Benchmark: HPGMG-FE

Scaling: idle to 8 cores incrementally

Blue: on-node power

Green: CM power

Observations:

  • Time lag
  • Consistent estimates of the total energy 
  • >10x difference between samping rates

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power Analysis on ARM - Workload 2

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Workload 2 - Abrupt Load

Benchmark: HPGMG-FE

Scaling: idle to 8 cores

Blue: on-node power

Green: CM power

Observations:

  • The time lag can be comparable to periods of high load  
  • At small time scales, the difference between energy estimates is noticable

Sampling On-Node Power Draw at Different Rates 

Dmitry Duplyakin, University of Colorado

Workload 1

Workload 2

  • Sampling at high rates: estimates of total energy are consistent
  • Sampling at low rates: estimates blow up
  • Can find this threshold programmatically

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Comparing Different Energy Estimates

Dmitry Duplyakin, University of Colorado

  • Mean and variance decrease as the interval grows
  • "Elbow" is at ~1000s; above that, the difference is at ~2%

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Workload 1

Workload 2

Interactive Analysis of Power Data

Dmitry Duplyakin, University of Colorado

powervis

https://github.com/emulab/shiny-server

  • Interactive dashboard for analysis and visualization of experiment power data
  • Uses R and Shiny (by RStudio)
  • Demo​

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power: Summary

  • Gained confidence with using power samples
  • Investigated granularity and accuracy of total energy estimates
  • Can experiment with energy optimization
  • Can make reasonable predictions

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Single Node Power Draw Min, W Mean, W Max, W sigma^2
Wisconsin, Idle 104.0 107.1 106.0 73.2
Wisconsin, 32 Threads BLIS 240.0 344.4 384.0 1089.24
Clemson,Idle 52.0 72.8 106.0 107.1
Clemson, 20 Threads BLIS 254.0 257.5 262.0 2.5
Utah, Idle 38 38.5 39.0 0.25
Utah, 8 Threads BLIS 64 88.6 97 84.15

Performance Analysis

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware CPU Clock Rate
Wisconsin,
Cisco UCS C220 M4 
Two Intel E5-2630 v3 8-core CPUs at
2.40 GHz (Haswell w/ EM64T)
Normal: 2.4 GHz
Turbo: 3.2 GHz

AVX Normal: 2.1 GHz
AVX Turbo: 3.2 GHz
Clemson,
Dell PowerEdge C8220
 
Two Intel E5-2660 v2 10-core CPUs at
2.20 GHz (Ivy Bridge)
Normal: 2.2 GHz
Turbo: 3.0 GHz
Utah,
HP ProLiant m400
​One ARMv8 64-bit (Atlas/A57) 8-core CPU at
2.4 GHz (APM X-GENE)
2.4 GHz​
No frequency scaling
  • Developed supporting cookbooks:
    • emulab-gcc, emulab-powervis, emulab-blis 

BLIS DGEMM: Single Core

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

41.70 GF

16.23 GF

3.17 GF

Single-core theoretical peak in DP:

  • 3.2 GHz (AVX, Turbo)
  • 2 FMAs per cycle
  • 2 flops in FMA
  • 4 doubles in vector units

Total: 51.2 GF

Single-core theoretical peak in DP:

  • 3.0 GHz (Turbo)
  • 1 FMAs per cycle
  • 2 flops in FMA
  • 4 doubles in vector units

Total: 24.0 GF

Single-core theoretical peak in DP:

  • 2.4 GHz (Normal)
  • 2 cycles per FMA
  • 2 flops in FMA
  • 2 doubles in vector units

Total: 4.8 GF

BLIS DGEMM: CPU and Node

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware Theoretical Peak BLIS DGEMM
Performance
BLIS DGEMM Energy Efficiency
Wisconsin
Cisco UCS C220 M4

2 Intel E5-2630 v3 8-core Haswell CPUs at
2.40 GHz
 
8 cores
2.1 GHz (AVX, Normal)
2 FMAs per cycle
2 flops in FMA
4 doubles in vector units
​Total CPU: 268.8 GF
Total node (2 CPUs): 537.6 GF
32 threads: 466.5 GF (87% of peak) 466.5 GF / 344.4 W
1.34 GF/W
 
Clemson
Dell PowerEdge C8220

2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at
2.20 GHz
 
10 cores
2.2 GHz (Normal)
1 FMAs per cycle
2 flops in FMA
4 doubles in vector units
​Total CPU: 176.0 GF
Total node(2 CPUs): 352.0
20 threads: 313.4 GF (89% of peak) 313.4 GF / 275.5 W
1.14 GF/W
​Utah
​HP ProLiant m400

1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at
2.4 GHz
8 cores
2.4 GHz (Normal)
2 cycles per FMA
2 flops in FMA
2 doubles in vector units
​Total CPU/node: 38.4 GF
8 threads: 22.6 GF (58% of peak) 22.6 GF / 88.6 W
0.26 GF/W
 

HPGMG-FE

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware Theoretical Peak HPGMG-FE HPGMG-FE Energy Efficiency
Wisconsin
Cisco UCS C220 M4

2 Intel E5-2630 v3 8-core Haswell CPUs at
2.40 GHz
 
8 cores
2.1 GHz (AVX, Normal)
2 FMAs per cycle
2 flops in FMA
4 doubles in vector units
​Total CPU: 268.8 GF
Total node (2 CPUs): 537.6 GF
32 threads: 93.57 GF (17.4% of peak,
20.1% of DGEMM)
93.57 GF / 302.2 W
0.31 GF/W
 
Clemson
Dell PowerEdge C8220 

2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at
2.20 GHz
 
10 cores
2.2 GHz (Normal)
1 FMAs per cycle
2 flops in FMA
4 doubles in vector units
​Total CPU: 176.0 GF
Total node(2 CPUs): 352.0
Estimated at:
20 threads: 73 GF
(20.1% of peak,
23.3% of DGEMM)
Estimated at: 
73 GF / 217 W
0.34 GF/W
​Utah
​HP ProLiant m400

1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at
2.4 GHz
8 cores
2.4 GHz (Normal)
2 cycles per FMA
2 flops in FMA
2 doubles in vector units
​Total CPU/node: 38.4 GF
8 threads: 10.1 GF (26% of peak,
44.7% of DGEMM)
 
10.1 GF / 70 W
0.14 GF/W
 

Measuring Performance of BLIS DGEMM on ARMv8

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

  • Can use PAPI hardware counters on ARMv8 to accurately measure application performance without code instrumentation

Running HPGMG-FE at Different Sites

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

  • Consistent performance at each site

Performance: Summary

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

  • Evaluated single-core, CPU, and node performance at every site
    • Theoretical peak, BLIS DGEMM, and HPGMG-FE
  • Estimated power efficiency (GF/W)
  • Used PAPI to estimate performance of HPGMG-FE on ARMv8
  • Proposal 1: use Periscope Tuning Framework (PTF) for compiler flag optimization
  • Proposal 2: use Extra-P for performance modeling of real applications

Topology: Deployed Chef Client on ARM

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Site stitching feature of CloudLab allows building profiles where Chef Server runs on x86 and manages Chef Clients on ARM

Topology: Desired Configuration for Benchmarking

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • head: runs Chef Server, SLURM, and powervis
  • node-X:  Chef Clients and SLURM workers
  • Need 3 different work queues in SLURM
  • Profile parameters will allow scaling of each worker pool 

Summary

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

  • Presented initial analysis of performance and energy efficiency
  • Argued for doing configuration management with Chef
  • Demoed the developed tools and techniques
  • Outlined future development with performance modeling, power analysis and automation with Chef

Dmitry Duplyakin, University of Colorado

Thank you!

Questions?

 

dmitry.duplyakin@colorado.edu

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Characterizing Performance and Power Efficiency on CloudLab

By Dmitry Duplyakin

Characterizing Performance and Power Efficiency on CloudLab

  • 1,056