Nayan Deshmukh
The increase in pin count is not proportional to the increase in transistor density
But these are roundabout ways to avoid the actual problem
What modifications are needed in our existing architecture?
A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing
for (v: graph.vertices) {
value = 0.85 * v.pagerank / v.out_degree;
for (w: v.successors) {
w.next_pagerank += value;
}
}
list_for (v: graph.vertices) {
value = 0.85 * v.pagerank / v.out_degree;
for (w: v.successors) {
put(w.id, function() { w.next_pagerank += value; });
}
}
Tesseract exploits:-
Normal Code
PIM Code
DDR3-OoO: 32 4 GHz four-wide out of-order cores connected to a DDR3 memory system
HMC-OoO: 32 4 GHz four-wide out of-order cores
HMC-MC: 512 single-issue, in-order cores externally connected to 16 memory cubes
Tesseract: 512 single-issue, in-order cores with prefetchers on logic layer of memory cubes
32 cores per cube
(PEI)
The key to coordination between PIM and host processor is single-cache-block restriction
The speed gap between CPU, memory and mass storage continues to widen. We need to rethink our memory systems. Processing in Memory is one of the possible hope to fight the memory wall