Characterizing Performance

and Power Efficiency

on CloudLab

Dmitry Duplyakin

University of Colorado at Boulder

dmitry.duplyakin@colorado.edu

Supercomputing 2015, 11/18/2015

About CloudLab

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Mission: cloud testbed with complete control, visibility and scientific fidelity
Available now: 2520 ARM cores, 2160 x86 cores
Highly heterogeneous hardware
More on hardware: https://www.cloudlab.us/hardware.php

Outline

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Background
Motivation for using configuration management
Chef as a powerful building block
Power analysis
Performance analysis
Useful topologies
Summary

Performance and Power Analysis: Questions?

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

How to provide consistency, transparency, repeatability?
What are the right building blocks?
Platform-wide, experiment-wide, node-wide?
Useful topologies and recommended user practices?
- start with a provided profile or extend an existing profile?

Performance Analysis: Challenges

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Performance analysis is not trivial
- Well-known in the Supercomputing community
Often done in the ad hoc manner
Single-node and multi-node techniques are different
CloudLab: 3 different platforms
- Ivy Bridge, Haswell, ARMv8

Power Analysis: Challenges

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Extremely platform-specific
Collecting is prone to failures, no debugging
Raw data: missing data, noise, unknown granularity
Need mechanisms for validation
From power to energy: need appropriate numerical integration
Need different experiment-wide and platform-wide analysis tools

Performance and Power in the Context of CloudLab

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Initial experience: new profile for a new cause
- Rapid development, no conflicts
Scalability problems:
- duplication of work
- rapid growth of the number of profiles and images
Can’t "merge" images
Can't experiment with different combinations of tools

Key Proposals

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Invest time and effort in code (not images)
Avoid ad hoc scripting
Use a configuration management system as a tool for enabling consistency, transparency, repeatability
Leverage previous experience with Chef and evaluate it on CloudLab
Employ the latest features of Chef: e.g., reporting and push jobs

Benefits of Using Chef

Dmitry Duplyakin, University of Colorado

Nice structure: roles, cookbooks, recipes, nodes, etc.
Idempotence
All artifacts are code: reusable and extendable
Over 2,000 community-developed cookbooks at Supermarket
Vision: single codebase for all sites within CloudLab

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Configuration Management with Chef: Architecture

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Chef Terminology

Dmitry Duplyakin, University of Colorado

Client -- node that is being configured
Server -- node that performs configuration
Workstation (knife) -- management utility
- Can run on the server; multiple workstations are allowed

Recipe = script
Cookbook -- collection of recipes
Role -- collection of cookbooks, typically assigned to clients
Environment -- cluster, aggregation, region, etc.
Attributes - variables set for specific nodes, environments, etc.

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Chef: Closer Look

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Demo:

Emulab chef-repo
emulab-nfs cookbook and related roles
Reporting and WUI

Chef: Push Jobs

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

name “get_power"
description "Role applied to nodes that need to download power data"
override_attributes(
  "push_jobs" => {

    "whitelist" => {
      “get_power" => "cd /tmp ; git clone https://github.com/dmdu/power-client.git ;\
                      /bin/bash -x power-client/power-client.sh -s clemson -l 12h"
    }

  }
)
run_list [ "push-jobs" ]

Push job wrapped in a role:

Submit role:

# knife role from file get_power.rb

Assign to a node:

# knife node run_list add head "role[get_power]"

Run the job:

# knife job start get_power head

Started. Job ID: 3f0ae42b88ea60365f7d07c64e30ff54

Running (1/1 in progress) ...

Running (0/1 in progress) ...

Big Picture with Chef

Dmitry Duplyakin, University of Colorado

From individual software tools to toolchains coordinated by Chef
Tradeoff: additional complexity, dependence on Chef
Argument 1: long-term investment
Argument 2: Chef takes stability seriously
- Example: procedure for installing Chef - a Chef cookbook - an artifact that is subject to rigorous testing
Argument 3: Chef Zero allows running cookbooks with no additional overhead, without a server/workstation:
- ```
chef-client -z -o <name of the cookbook>
```

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Evolution of Chef on CloudLab

Dmitry Duplyakin, University of Colorado

Prerequisite: experiment-wide keys
Created a profile with Chef 11
Upgraded to Chef 12 and enabled push jobs and reporting
Scripted profile creation and added parameters
Added support for ARM 64-bit
Future: make integration with existing profiles transparent

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power Analysis

Dmitry Duplyakin, University of Colorado

Site, Harware	CPU	Power, Frequency
Wisconsin, Cisco UCS C220 M4	Two Intel E5-2630 v3 8-core CPUs at 2.40 GHz (Haswell w/ EM64T)	TDP: 85 W Turbo: 3.2 GHz
Clemson, Dell PowerEdge C8220	Two Intel E5-2660 v2 10-core CPUs at 2.20 GHz (Ivy Bridge)	TDP: 95 W Turbo: 3 GHz
Utah, HP ProLiant m400	One ARMv8 64-bit (Atlas/A57) 8-core CPU at 2.4 GHz (APM X-GENE)	TDP ? No frequency scaling

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Developed supporting cookbooks:
- emulab-R, emulab-shiny, emulab-powervis

Sources of Power Data

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

IOUT_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x8C

VIN_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x88

IOUT = int(IOUT_hex,16) * 0.01239 - 25.3717
VIN = int(VIN_hex,16) * 0.005208

POWER = VIN*IOUT

On-node power (available on ARM at Utah only):

Chassis Manager (CM) power data (available for all 3 sites)
- Collectors and database managed by Emmanuel Cecchet
- ```
power-client.sh -s clemson -l 12h
```

Power Analysis on ARM - Workload 1

Dmitry Duplyakin, University of Colorado

Workload 1 - Gradual Load

Benchmark: HPGMG-FE

Scaling: idle to 8 cores incrementally

Blue: on-node power

Green: CM power

Observations:

Time lag
Consistent estimates of the total energy
>10x difference between samping rates

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power Analysis on ARM - Workload 2

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Workload 2 - Abrupt Load

Benchmark: HPGMG-FE

Scaling: idle to 8 cores

Blue: on-node power

Green: CM power

Observations:

The time lag can be comparable to periods of high load
At small time scales, the difference between energy estimates is noticable

Sampling On-Node Power Draw at Different Rates

Dmitry Duplyakin, University of Colorado

Workload 1

Workload 2

Sampling at high rates: estimates of total energy are consistent
Sampling at low rates: estimates blow up
Can find this threshold programmatically

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Comparing Different Energy Estimates

Dmitry Duplyakin, University of Colorado

Mean and variance decrease as the interval grows
"Elbow" is at ~1000s; above that, the difference is at ~2%

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Workload 1

Workload 2

Interactive Analysis of Power Data

Dmitry Duplyakin, University of Colorado

powervis

https://github.com/emulab/shiny-server

Interactive dashboard for analysis and visualization of experiment power data
Uses R and Shiny (by RStudio)
Demo

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Power: Summary

Gained confidence with using power samples
Investigated granularity and accuracy of total energy estimates
Can experiment with energy optimization
- For instance, using Periscope Tuning Framework (PTF)
Can make reasonable predictions

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Single Node Power Draw	Min, W	Mean, W	Max, W	sigma^2
Wisconsin, Idle	104.0	107.1	106.0	73.2
Wisconsin, 32 Threads BLIS	240.0	344.4	384.0	1089.24
Clemson,Idle	52.0	72.8	106.0	107.1
Clemson, 20 Threads BLIS	254.0	257.5	262.0	2.5
Utah, Idle	38	38.5	39.0	0.25
Utah, 8 Threads BLIS	64	88.6	97	84.15

Performance Analysis

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware	CPU	Clock Rate
Wisconsin, Cisco UCS C220 M4	Two Intel E5-2630 v3 8-core CPUs at 2.40 GHz (Haswell w/ EM64T)	Normal: 2.4 GHz Turbo: 3.2 GHz AVX Normal: 2.1 GHz AVX Turbo: 3.2 GHz
Clemson, Dell PowerEdge C8220	Two Intel E5-2660 v2 10-core CPUs at 2.20 GHz (Ivy Bridge)	Normal: 2.2 GHz Turbo: 3.0 GHz
Utah, HP ProLiant m400	One ARMv8 64-bit (Atlas/A57) 8-core CPU at 2.4 GHz (APM X-GENE)	2.4 GHz No frequency scaling

Developed supporting cookbooks:
- emulab-gcc, emulab-powervis, emulab-blis

BLIS DGEMM: Single Core

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

41.70 GF

16.23 GF

3.17 GF

Single-core theoretical peak in DP:

3.2 GHz (AVX, Turbo)
2 FMAs per cycle
2 flops in FMA
4 doubles in vector units

Total: 51.2 GF

Single-core theoretical peak in DP:

3.0 GHz (Turbo)
1 FMAs per cycle
2 flops in FMA
4 doubles in vector units

Total: 24.0 GF

Single-core theoretical peak in DP:

2.4 GHz (Normal)
2 cycles per FMA
2 flops in FMA
2 doubles in vector units

Total: 4.8 GF

BLIS DGEMM: CPU and Node

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware	Theoretical Peak	BLIS DGEMM Performance	BLIS DGEMM Energy Efficiency
Wisconsin Cisco UCS C220 M4 2 Intel E5-2630 v3 8-core Haswell CPUs at 2.40 GHz	8 cores 2.1 GHz (AVX, Normal) 2 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 268.8 GF Total node (2 CPUs): 537.6 GF	32 threads: 466.5 GF (87% of peak)	466.5 GF / 344.4 W 1.34 GF/W
Clemson Dell PowerEdge C8220 2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at 2.20 GHz	10 cores 2.2 GHz (Normal) 1 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 176.0 GF Total node(2 CPUs): 352.0	20 threads: 313.4 GF (89% of peak)	313.4 GF / 275.5 W 1.14 GF/W
Utah HP ProLiant m400 1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at 2.4 GHz	8 cores 2.4 GHz (Normal) 2 cycles per FMA 2 flops in FMA 2 doubles in vector units Total CPU/node: 38.4 GF	8 threads: 22.6 GF (58% of peak)	22.6 GF / 88.6 W 0.26 GF/W

HPGMG-FE

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Site, Harware	Theoretical Peak	HPGMG-FE	HPGMG-FE Energy Efficiency
Wisconsin Cisco UCS C220 M4 2 Intel E5-2630 v3 8-core Haswell CPUs at 2.40 GHz	8 cores 2.1 GHz (AVX, Normal) 2 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 268.8 GF Total node (2 CPUs): 537.6 GF	32 threads: 93.57 GF (17.4% of peak, 20.1% of DGEMM)	93.57 GF / 302.2 W 0.31 GF/W
Clemson Dell PowerEdge C8220 2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at 2.20 GHz	10 cores 2.2 GHz (Normal) 1 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 176.0 GF Total node(2 CPUs): 352.0	Estimated at: 20 threads: 73 GF (20.1% of peak, 23.3% of DGEMM)	Estimated at: 73 GF / 217 W 0.34 GF/W
Utah HP ProLiant m400 1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at 2.4 GHz	8 cores 2.4 GHz (Normal) 2 cycles per FMA 2 flops in FMA 2 doubles in vector units Total CPU/node: 38.4 GF	8 threads: 10.1 GF (26% of peak, 44.7% of DGEMM)	10.1 GF / 70 W 0.14 GF/W

Measuring Performance of BLIS DGEMM on ARMv8

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Can use PAPI hardware counters on ARMv8 to accurately measure application performance without code instrumentation

Running HPGMG-FE at Different Sites

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Consistent performance at each site

Performance: Summary

Dmitry Duplyakin, University of Colorado

Performance of CloudLab Resources

09/30/2015

Evaluated single-core, CPU, and node performance at every site
- Theoretical peak, BLIS DGEMM, and HPGMG-FE
Estimated power efficiency (GF/W)
Used PAPI to estimate performance of HPGMG-FE on ARMv8
Proposal 1: use Periscope Tuning Framework (PTF) for compiler flag optimization
Proposal 2: use Extra-P for performance modeling of real applications

Topology: Deployed Chef Client on ARM

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Site stitching feature of CloudLab allows building profiles where Chef Server runs on x86 and manages Chef Clients on ARM

Topology: Desired Configuration for Benchmarking

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

head: runs Chef Server, SLURM, and powervis
node-X: Chef Clients and SLURM workers
Need 3 different work queues in SLURM
Profile parameters will allow scaling of each worker pool

Summary

Dmitry Duplyakin, University of Colorado

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Presented initial analysis of performance and energy efficiency
Argued for doing configuration management with Chef
Demoed the developed tools and techniques
Outlined future development with performance modeling, power analysis and automation with Chef

Dmitry Duplyakin, University of Colorado

Thank you!

Questions?

dmitry.duplyakin@colorado.edu

Characterizing Performance and Power Efficiency on CloudLab

11/18/2015

Characterizing Performance and Power Efficiency on CloudLab

By Dmitry Duplyakin

Characterizing Performance and Power Efficiency on CloudLab

1,280

Characterizing Performance and Power Efficiency on CloudLab

More from Dmitry Duplyakin