Characterizing Performance
and Power Efficiency
on CloudLab
Dmitry Duplyakin
University of Colorado at Boulder
dmitry.duplyakin@colorado.edu
Supercomputing 2015, 11/18/2015
About CloudLab
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Mission: cloud testbed with complete control, visibility and scientific fidelity
- Available now: 2520 ARM cores, 2160 x86 cores
- Highly heterogeneous hardware
- More on hardware: https://www.cloudlab.us/hardware.php
Outline
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Background
- Motivation for using configuration management
- Chef as a powerful building block
- Power analysis
- Performance analysis
- Useful topologies
- Summary
Performance and Power Analysis: Questions?
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
-
How to provide consistency, transparency, repeatability?
-
What are the right building blocks?
-
Platform-wide, experiment-wide, node-wide?
-
Useful topologies and recommended user practices?
-
start with a provided profile or extend an existing profile?
-
Performance Analysis: Challenges
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Performance analysis is not trivial
- Well-known in the Supercomputing community
- Often done in the ad hoc manner
- Single-node and multi-node techniques are different
-
CloudLab: 3 different platforms
-
Ivy Bridge, Haswell, ARMv8
-
Power Analysis: Challenges
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
-
Extremely platform-specific
-
Collecting is prone to failures, no debugging
-
Raw data: missing data, noise, unknown granularity
-
Need mechanisms for validation
-
From power to energy: need appropriate numerical integration
-
Need different experiment-wide and platform-wide analysis tools
Performance and Power in the Context of CloudLab
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Initial experience: new profile for a new cause
- Rapid development, no conflicts
- Scalability problems:
- duplication of work
- rapid growth of the number of profiles and images
- Can’t "merge" images
- Can't experiment with different combinations of tools
Key Proposals
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Invest time and effort in code (not images)
- Avoid ad hoc scripting
- Use a configuration management system as a tool for enabling consistency, transparency, repeatability
- Leverage previous experience with Chef and evaluate it on CloudLab
- Employ the latest features of Chef: e.g., reporting and push jobs
Benefits of Using Chef
Dmitry Duplyakin, University of Colorado
- Nice structure: roles, cookbooks, recipes, nodes, etc.
- Idempotence
- All artifacts are code: reusable and extendable
- Over 2,000 community-developed cookbooks at Supermarket
- Vision: single codebase for all sites within CloudLab
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Configuration Management with Chef: Architecture
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Chef Terminology
Dmitry Duplyakin, University of Colorado
- Client -- node that is being configured
- Server -- node that performs configuration
-
Workstation (knife) -- management utility
- Can run on the server; multiple workstations are allowed
- Recipe = script
- Cookbook -- collection of recipes
- Role -- collection of cookbooks, typically assigned to clients
- Environment -- cluster, aggregation, region, etc.
- Attributes - variables set for specific nodes, environments, etc.
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Chef: Closer Look
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Demo:
- Emulab chef-repo
- emulab-nfs cookbook and related roles
- Reporting and WUI
Chef: Push Jobs
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
name “get_power"
description "Role applied to nodes that need to download power data"
override_attributes(
"push_jobs" => {
"whitelist" => {
“get_power" => "cd /tmp ; git clone https://github.com/dmdu/power-client.git ;\
/bin/bash -x power-client/power-client.sh -s clemson -l 12h"
}
}
)
run_list [ "push-jobs" ]
- Push job wrapped in a role:
-
Submit role:
# knife role from file get_power.rb
-
Assign to a node:
# knife node run_list add head "role[get_power]"
-
Run the job:
# knife job start get_power head
Started. Job ID: 3f0ae42b88ea60365f7d07c64e30ff54
Running (1/1 in progress) ...
Running (0/1 in progress) ...
Big Picture with Chef
Dmitry Duplyakin, University of Colorado
- From individual software tools to toolchains coordinated by Chef
- Tradeoff: additional complexity, dependence on Chef
- Argument 1: long-term investment
- Argument 2: Chef takes stability seriously
- Example: procedure for installing Chef - a Chef cookbook - an artifact that is subject to rigorous testing
- Argument 3: Chef Zero allows running cookbooks with no additional overhead, without a server/workstation:
-
chef-client -z -o <name of the cookbook>
-
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Evolution of Chef on CloudLab
Dmitry Duplyakin, University of Colorado
- Prerequisite: experiment-wide keys
- Created a profile with Chef 11
- Upgraded to Chef 12 and enabled push jobs and reporting
- Scripted profile creation and added parameters
- Added support for ARM 64-bit
- Future: make integration with existing profiles transparent
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Power Analysis
Dmitry Duplyakin, University of Colorado
Site, Harware | CPU | Power, Frequency |
---|---|---|
Wisconsin, Cisco UCS C220 M4 |
Two Intel E5-2630 v3 8-core CPUs at 2.40 GHz (Haswell w/ EM64T) |
TDP: 85 W Turbo: 3.2 GHz |
Clemson, Dell PowerEdge C8220 |
Two Intel E5-2660 v2 10-core CPUs at 2.20 GHz (Ivy Bridge) |
TDP: 95 W Turbo: 3 GHz |
Utah, HP ProLiant m400 |
One ARMv8 64-bit (Atlas/A57) 8-core CPU at 2.4 GHz (APM X-GENE) |
TDP ? No frequency scaling |
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Developed supporting cookbooks:
- emulab-R, emulab-shiny, emulab-powervis
Sources of Power Data
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
IOUT_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x8C
VIN_hex = ipmi-raw --no-probing --driver-type=SSIF \
--driver-address=0x10 --driver-device=/dev/i2c-0 \
0 6 0x52 0x05 0x40 0x02 0x88
IOUT = int(IOUT_hex,16) * 0.01239 - 25.3717
VIN = int(VIN_hex,16) * 0.005208
POWER = VIN*IOUT
- On-node power (available on ARM at Utah only):
- Chassis Manager (CM) power data (available for all 3 sites)
- Collectors and database managed by Emmanuel Cecchet
-
power-client.sh -s clemson -l 12h
Power Analysis on ARM - Workload 1
Dmitry Duplyakin, University of Colorado
Workload 1 - Gradual Load
Benchmark: HPGMG-FE
Scaling: idle to 8 cores incrementally
Blue: on-node power
Green: CM power
Observations:
- Time lag
- Consistent estimates of the total energy
- >10x difference between samping rates
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Power Analysis on ARM - Workload 2
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Workload 2 - Abrupt Load
Benchmark: HPGMG-FE
Scaling: idle to 8 cores
Blue: on-node power
Green: CM power
Observations:
- The time lag can be comparable to periods of high load
- At small time scales, the difference between energy estimates is noticable
Sampling On-Node Power Draw at Different Rates
Dmitry Duplyakin, University of Colorado
Workload 1
Workload 2
- Sampling at high rates: estimates of total energy are consistent
- Sampling at low rates: estimates blow up
- Can find this threshold programmatically
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Comparing Different Energy Estimates
Dmitry Duplyakin, University of Colorado
- Mean and variance decrease as the interval grows
- "Elbow" is at ~1000s; above that, the difference is at ~2%
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Workload 1
Workload 2
Interactive Analysis of Power Data
Dmitry Duplyakin, University of Colorado
powervis
https://github.com/emulab/shiny-server
- Interactive dashboard for analysis and visualization of experiment power data
- Uses R and Shiny (by RStudio)
- Demo
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Power: Summary
- Gained confidence with using power samples
- Investigated granularity and accuracy of total energy estimates
- Can experiment with energy optimization
- For instance, using Periscope Tuning Framework (PTF)
- Can make reasonable predictions
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Single Node Power Draw | Min, W | Mean, W | Max, W | sigma^2 |
---|---|---|---|---|
Wisconsin, Idle | 104.0 | 107.1 | 106.0 | 73.2 |
Wisconsin, 32 Threads BLIS | 240.0 | 344.4 | 384.0 | 1089.24 |
Clemson,Idle | 52.0 | 72.8 | 106.0 | 107.1 |
Clemson, 20 Threads BLIS | 254.0 | 257.5 | 262.0 | 2.5 |
Utah, Idle | 38 | 38.5 | 39.0 | 0.25 |
Utah, 8 Threads BLIS | 64 | 88.6 | 97 | 84.15 |
Performance Analysis
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
Site, Harware | CPU | Clock Rate |
---|---|---|
Wisconsin, Cisco UCS C220 M4 |
Two Intel E5-2630 v3 8-core CPUs at 2.40 GHz (Haswell w/ EM64T) |
Normal: 2.4 GHz Turbo: 3.2 GHz AVX Normal: 2.1 GHz AVX Turbo: 3.2 GHz |
Clemson, Dell PowerEdge C8220 |
Two Intel E5-2660 v2 10-core CPUs at 2.20 GHz (Ivy Bridge) |
Normal: 2.2 GHz Turbo: 3.0 GHz |
Utah, HP ProLiant m400 |
One ARMv8 64-bit (Atlas/A57) 8-core CPU at 2.4 GHz (APM X-GENE) |
2.4 GHz No frequency scaling |
- Developed supporting cookbooks:
- emulab-gcc, emulab-powervis, emulab-blis
BLIS DGEMM: Single Core
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
41.70 GF
16.23 GF
3.17 GF
Single-core theoretical peak in DP:
- 3.2 GHz (AVX, Turbo)
- 2 FMAs per cycle
- 2 flops in FMA
- 4 doubles in vector units
Total: 51.2 GF
Single-core theoretical peak in DP:
- 3.0 GHz (Turbo)
- 1 FMAs per cycle
- 2 flops in FMA
- 4 doubles in vector units
Total: 24.0 GF
Single-core theoretical peak in DP:
- 2.4 GHz (Normal)
- 2 cycles per FMA
- 2 flops in FMA
- 2 doubles in vector units
Total: 4.8 GF
BLIS DGEMM: CPU and Node
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
Site, Harware | Theoretical Peak | BLIS DGEMM Performance |
BLIS DGEMM Energy Efficiency |
---|---|---|---|
Wisconsin Cisco UCS C220 M4 2 Intel E5-2630 v3 8-core Haswell CPUs at 2.40 GHz |
8 cores 2.1 GHz (AVX, Normal) 2 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 268.8 GF Total node (2 CPUs): 537.6 GF |
32 threads: 466.5 GF (87% of peak) | 466.5 GF / 344.4 W 1.34 GF/W |
Clemson Dell PowerEdge C8220 2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at 2.20 GHz |
10 cores 2.2 GHz (Normal) 1 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 176.0 GF Total node(2 CPUs): 352.0 |
20 threads: 313.4 GF (89% of peak) | 313.4 GF / 275.5 W 1.14 GF/W |
Utah HP ProLiant m400 1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at 2.4 GHz |
8 cores 2.4 GHz (Normal) 2 cycles per FMA 2 flops in FMA 2 doubles in vector units Total CPU/node: 38.4 GF |
8 threads: 22.6 GF (58% of peak) | 22.6 GF / 88.6 W 0.26 GF/W |
HPGMG-FE
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
Site, Harware | Theoretical Peak | HPGMG-FE | HPGMG-FE Energy Efficiency |
---|---|---|---|
Wisconsin Cisco UCS C220 M4 2 Intel E5-2630 v3 8-core Haswell CPUs at 2.40 GHz |
8 cores 2.1 GHz (AVX, Normal) 2 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 268.8 GF Total node (2 CPUs): 537.6 GF |
32 threads: 93.57 GF (17.4% of peak, 20.1% of DGEMM) |
93.57 GF / 302.2 W 0.31 GF/W |
Clemson Dell PowerEdge C8220 2 Intel E5-2660 v2 10-core Ivy Bridge CPUs at 2.20 GHz |
10 cores 2.2 GHz (Normal) 1 FMAs per cycle 2 flops in FMA 4 doubles in vector units Total CPU: 176.0 GF Total node(2 CPUs): 352.0 |
Estimated at: 20 threads: 73 GF (20.1% of peak, 23.3% of DGEMM) |
Estimated at: 73 GF / 217 W 0.34 GF/W |
Utah HP ProLiant m400 1 ARMv8 64-bit (Atlas/A57) 8-core APM X-GENE CPU at 2.4 GHz |
8 cores 2.4 GHz (Normal) 2 cycles per FMA 2 flops in FMA 2 doubles in vector units Total CPU/node: 38.4 GF |
8 threads: 10.1 GF (26% of peak, 44.7% of DGEMM) |
10.1 GF / 70 W 0.14 GF/W |
Measuring Performance of BLIS DGEMM on ARMv8
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
- Can use PAPI hardware counters on ARMv8 to accurately measure application performance without code instrumentation
Running HPGMG-FE at Different Sites
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
- Consistent performance at each site
Performance: Summary
Dmitry Duplyakin, University of Colorado
Performance of CloudLab Resources
09/30/2015
- Evaluated single-core, CPU, and node performance at every site
- Theoretical peak, BLIS DGEMM, and HPGMG-FE
- Estimated power efficiency (GF/W)
- Used PAPI to estimate performance of HPGMG-FE on ARMv8
- Proposal 1: use Periscope Tuning Framework (PTF) for compiler flag optimization
- Proposal 2: use Extra-P for performance modeling of real applications
Topology: Deployed Chef Client on ARM
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Site stitching feature of CloudLab allows building profiles where Chef Server runs on x86 and manages Chef Clients on ARM
Topology: Desired Configuration for Benchmarking
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- head: runs Chef Server, SLURM, and powervis
- node-X: Chef Clients and SLURM workers
- Need 3 different work queues in SLURM
- Profile parameters will allow scaling of each worker pool
Summary
Dmitry Duplyakin, University of Colorado
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
- Presented initial analysis of performance and energy efficiency
- Argued for doing configuration management with Chef
- Demoed the developed tools and techniques
- Outlined future development with performance modeling, power analysis and automation with Chef
Dmitry Duplyakin, University of Colorado
Thank you!
Questions?
dmitry.duplyakin@colorado.edu
Characterizing Performance and Power Efficiency on CloudLab
11/18/2015
Characterizing Performance and Power Efficiency on CloudLab
By Dmitry Duplyakin
Characterizing Performance and Power Efficiency on CloudLab
- 1,056