Principles of Computer Systems
Spring 2019
Stanford University
Computer Science Department
Instructors: Chris Gregg and Philip Levis
We wanted to finish up the quarter with a look at the relevance of systems in today's computing world. It is hard to think of a modern computing task that hasn't needed high quality systems ideas to make them relevant. Whether it is for speeding up specific computing tasks (e.g., for AI), improving networking performance, scaling to datacenter-sized computing centers, or making robust secure, redundant storage solutions, systems has had a major role. We are going to look at the following technologies and discuss how systems has played its part in their use:
We have used Linux exclusively in class, and Linux is used in most of the companies you will be working at, at least for servers in the back-end. Virtually all datacenters use Linux for the back end, (even Microsoft data centers).
The fact that Linux is open source and has hooks into the lowest level of the kernel (indeed, you can look at the code for the Linux code whenever you want) means that it is ripe for getting peak performance when you need it -- this is important in many businesses and technologies that rely on huge numbers of highly connected computers (e.g., companies that use lots of AI).
image source: https://storage.googleapis.com/nexttpu/index.html
John Hennessy and David Patterson, Computer Architecture, A Quantitative Approach, 6/e 20118
Image: Google's Tensor Processing Unit 3.0 https://en.wikipedia.org/wiki/Tensor_processing_unit#/media/File:Tensor_Processing_Unit_3.0.jpg
Image: Google's Tensor Processing Unit 3.0 https://en.wikipedia.org/wiki/Tensor_processing_unit#/media/File:Tensor_Processing_Unit_3.0.jpg
"If it’s raining outside, you probably don’t need to know exactly how many droplets of water are falling per second — you just wonder whether it’s raining lightly or heavily. Similarly, neural network predictions often don't require the precision of floating point calculations." (same link as above)
Floor Plan of TPU die(yellow = compute, blue = data, green = I/O, red = control)
Oh...one other thing: TPUs can work together in pods inside a data center.
The latest version also handles floating point operations
Cloud TPU v3 Pod (beta)
100+ petaflops
32 TB High Bandwidth Memory
2-D toroidal mesh network
What's a petaflop? FLOPs are "floating point operations per second", so a petaflop is 1015 floating point operations per second. Your laptop is in the 10s of Gigaflops (1010)
range, so a TPU Pod can perform roughly 10 million times as many floating point operations per second than your laptop...but it can't browse Reddit.
While we are on Google and their machine learning tools, let's talk about TensorFlow for a moment. TensorFlow is an open source library for machine learning that works on many platforms (Linux, Windows, MacOS, Raspberry Pi, Android and Javascript (!)) and on
many devices (CPUs, GPUs, TPUs). It is written in Python, C++, and CUDA, meaning that there is a ton of systems that goes into it (esp. when interfacing with GPUs using CUDA). In fact, if you look at the GitHub repository, you'll see lots of language-specific folders:
In other words, if you want to work on the TensorFlow library, you probably have to understand a lot of different languages, and be well-versed in systems.
Oh -- as Phil has mentioned before, the lead of Google AI, Jeff Dean (remember the MapReduce paper?) is a systems person, and he has played a big role in TensorFlow.
What if you want to run a server? What do you do?
Let's say you take one of the options above and you set up TheCutestKittenPics.com, and all of the sudden it becomes a huge hit. Your web server gets slammed -- what do you do?
Those were the options (mostly) until Amazon launched its Elastic Cloud Computer (EC2) service in 2006. EC2 allows users to run virtual machines or bare-hardware servers in the cloud, and it enables websites (for example) to scale quickly and easily.
You might be surprised at some of the companies and other organizations that use EC2: Netflix, NASA, Slack, Adobe, Reddit, and many more.
Amazon (and its systems and IT people) handle all of the following:
You have complete sudo access to the machines, and if you break a virtual machine, you just start up a new one. If you need GPUs for your workload? EC2 has those, too.
Amazon also offers its Lambda service, which is a serverless compute solution -- you don't manage any servers, and you simply upload your code to Amazon, they run it, and they give you back the results.
Amazon also offers a storage solution for Internet users, called the Simple Storage Service (S3).
S3 provides the following features:
Amazon touts the following about S3:
All of the above features require a well-oiled system, and Amazon employs many systems people to innovate and to keep things running smoothly.
Who owns the Internet?
Who owns the Internet? No one!
Well...lots of companies own data lines (mostly fiber optic) that connect the world through the Internet.
Source (overplayed over map of U.S.): InterTubes: A Study of the US Long-haul Fiber-optic Infrastructure
“The map shows the paths taken by the long-distance fiber-optic cables that carry Internet data across the continental U.S. The exact routes of those cables, which belong to major telecommunications companies such as AT&T and Level 3, have not been previously publicly viewable, despite the fact that they are effectively critical public infrastructure” (https://www.eteknix.com/first-detailed-public-map-u-s-internet-infrastructure-released/)
The so-called Internet Backbone is a nebulous route between major routers in the world, which allow Internet traffic to flow.
Many of the owners of the network connections are telephone companies, mainly because the Internet grew up
using the same physical locations (telephone poles, underground cable runs, etc.). So, AT&T owns a lot of the network, as do Verizon, Sprint, and CenturyLink.