Processing In Memory

Nayan Deshmukh

MEmory Wall

Since 1980, CPU has outpaced DRAM

The bandwidth Wall

The increase in pin count is not proportional to the increase in transistor density

what is the Solution?

Cache Compression
DRAM Cache
Link Compression
Sector Cache

reduce the memory traffic

But these are roundabout ways to avoid the actual problem

let's look more closely at the dram

Hybrid Memory Cube

<DETOUr>

HMC Structure

HMC structure

</DETOUr>

how to integrate PiM?

What modifications are needed in our existing architecture?

tesseract

A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing

page rank

for (v: graph.vertices) {
    value = 0.85 * v.pagerank / v.out_degree;
    for (w: v.successors) {
        w.next_pagerank += value;
     }
}

list_for (v: graph.vertices) {
    value = 0.85 * v.pagerank / v.out_degree;
     for (w: v.successors) {
         put(w.id, function() { w.next_pagerank += value; });
     }
}

Tesseract exploits:-

Memory level parallelism
DRAM internal bandwidth
Offloading

Normal Code

PIM Code

Performance

DDR3-OoO: 32 4 GHz four-wide out of-order cores connected to a DDR3 memory system
HMC-OoO: 32 4 GHz four-wide out of-order cores
HMC-MC: 512 single-issue, in-order cores externally connected to 16 memory cubes
Tesseract: 512 single-issue, in-order cores with prefetchers on logic layer of memory cubes
- 32 cores per cube

What's the catch?

Needs a lot of changes in software stack
Fails to utilize the large on-chip caches
Higher AMAT when results are accessed by host processor

Can we do better?

PIM-Enabled Instructions

(PEI)

Potential of ISA Extension as PIM Interface

The key to coordination between PIM and host processor is single-cache-block restriction

Each PEI can access at most one last-level cache block

Localization: each PEI is bounded to one memory module
Interoperability: easier support for cache coherence and virtual memory

BENEFITS:-

architecture

Memory-side PEI Execution

Host-side PEI Execution

conclusion

The speed gap between CPU, memory and mass storage continues to widen. We need to rethink our memory systems. Processing in Memory is one of the possible hope to fight the memory wall

Acknowledgements

https://people.inf.ethz.ch/omutlu/pub/pim-enabled-instructons-for-low-overhead-pim_isca15-talk.pdf
https://people.inf.ethz.ch/omutlu/pub/tesseract-pim-architecture-for-graph-processing_isca15-talk.pdf
http://www.eecs.umich.edu/courses/eecs573/lectures/nmc_slides-main.pdf
http://extremecomputingtraining.anl.gov/files/2015/03/kogge-jul29-1115.pdf
http://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.18.3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron.pdf
http://www2.sbc.org.br/sbac/2015/files/Keynote_OnurMutlu.pdf

Processing in Memory

By Nayan Deshmukh

Processing in Memory

COE218 Advanced Computer Systems Architecture

3,151

Processing In Memory

MEmory Wall

Since 1980, CPU has outpaced DRAM

The bandwidth Wall

what is the Solution?

reduce the memory traffic

let's look more closely at the dram

Hybrid Memory Cube

<DETOUr>

HMC Structure

HMC structure

</DETOUr>

how to integrate PiM?

tesseract

page rank

Performance

What's the catch?

Can we do better?

PIM-Enabled Instructions

Potential of ISA Extension as PIM Interface

BENEFITS:-

architecture

Memory-side PEI Execution

Host-side PEI Execution

conclusion

Acknowledgements

Processing in Memory

More from Nayan Deshmukh