CS110 Lecture 24: Networking Functions and MapReduce
CS110: Principles of Computer Systems
Winter 2021-2022
Stanford University
Instructors: Nick Troccoli and Jerry Cain
CS110 Topic 4: How can we write programs that communicate over a network with other programs?
Learning About Networking
Introduction to Networking
Servers / HTTP
HTTP and APIs
Networking System Calls / Library Functions
Lecture 20
Lecture 21
Lecture 22/23
Today
assign6: implement an HTTP Proxy that sits between a client device and a web server to monitor, block or modify web traffic.
Learning Goals
- Learn about the implementations of createClientSocket and createServerSocket
- Apply our knowledge of networking and concurrency to understand MapReduce
- Learn about the MapReduce library and how it parallelizes operations
- Understand how to write a program that can be run with MapReduce
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
- Extra topic: MapReduce
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
- Extra topic: MapReduce
createClientSocket and createServerSocket
Let's see the underlying system calls and library functions needed to implement createClientSocket and createServerSocket!
- Goal: to see the kinds of functions required (you won't have to re-implement createClientSocket or createServerSocket)
- Goal: to see the design decisions and language workarounds involved
Clients
We have used createClientSocket in client programs so far to connect to servers. It gives us back a descriptor we can use to read/write data.
But how is the createClientSocket helper function actually implemented?
int main(int argc, char *argv[]) {
// Open a connection to the server
int socketDescriptor = createClientSocket("myth64.stanford.edu", 12345);
// Read in the data from the server (sockbuf descructor closes descriptor)
sockbuf socketBuffer(socketDescriptor);
iosockstream socketStream(&socketBuffer);
string timeline;
getline(socketStream, timeline);
// Print the data from the server
cout << timeline << endl;
return 0;
}
createClientSocket
- Check that the specified server and port are valid
- Create a new socket descriptor
- Associate this socket descriptor with a connection to that server
- Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
- Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
- Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);
- We check the validity of the host by attempting to look up their IP address
- gethostbyname() gets IPV4 host info for the given name (e.g. "www.facebook.com")
- gethostbyname2() can get IPV6 host info for the given name - second param can be AF_INET (for IPv4) or AF_INET6 (for IPv6)
- Technically deprecated in favor of getAddrInfo, but still prevalent and good to know
- Both return a statically allocated
struct hostent
with host's info (or NULL if error)
struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
createClientSocket
int createClientSocket(const string& host, unsigned short port) {
struct hostent *he = gethostbyname(host.c_str());
if (he == NULL) return -1;
...
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
int socket(int domain, int type, int protocol);
int createClientSocket(const string& host, unsigned short port) {
...
int s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0) return -1;
...
The socket function creates a socket endpoint and returns a descriptor.
- The first parameter is the protocol family (IPv4, IPv6, Bluetooth, etc.).
- The second parameter is the type of the connection - do we want a reliable 2-way connection, unreliable (but faster) connection, etc.?
- The third parameter is the protocol (0 for default)
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);
connect connects the specified socket to the specified address.
- Wait a minute - we could be using IPv4 or IPv6. How can we have the same parameter types for both?
connect()
int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);
There are actually multiple different types of we may want to pass in. sockaddr_in and sockaddr_in6. How can we handle these possibilities? C doesn't support inheritance or templates.
- First idea: we could make a new version of connect for each type (not great)
- Second idea: we could specify the parameter type as void * (but then how would we know the real type?)
-
Third idea: we could make the parameter type a "parent type" called sockaddr, which will have the same memory layout as sockaddr_in and sockaddr_in6.
- Its structure is a 2 byte type field followed by 14 bytes of something.
- Both sockaddr_in and sockaddr_in6 will start with that 2 byte type field, and use the remaining 14 bytes for whatever they want.
- connect can then check the type field before casting to the appropriate type
connect()
int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);
We will make the parameter type a "parent type" called sockaddr, which will have the same memory layout as sockaddr_in and sockaddr_in6. Its structure is a 2 byte type field followed by 14 bytes of something. Both sockaddr_in and sockaddr_in6 will start with that 2 byte type field, and use the remaining 14 bytes for whatever they want.
struct sockaddr { // generic socket
unsigned short sa_family; // protocol family for socket
char sa_data[14];
// address data (and defines full size to be 16 bytes)
};
struct sockaddr_in { // IPv4 socket address record
unsigned short sin_family;
unsigned short sin_port;
struct in_addr sin_addr;
unsigned char sin_zero[8];
};
struct sockaddr_in6 { // IPv6 socket address record
unsigned short sin6_family;
unsigned short sin6_port;
unsigned int sin6_flowinfo;
struct in6_addr sin6_addr;
unsigned int sin6_scope_id;
};
sockaddr_in
struct sockaddr_in { // IPv4 socket address record
unsigned short sin_family;
unsigned short sin_port;
struct in_addr sin_addr;
unsigned char sin_zero[8];
};
- The
sin_family
field should storeAF_INET
for IPv4 - The
sin_port
field stores a port number in network byte order.- Different machines may store multi-byte values in different orders (big endian, little endian). But network data must be sent in a consistent format.
- The
sin_addr
field stores the IPv4 address - The
sin_zero
field represents the remaining 8 bytes that are unused.
sockaddr_in6
- The
sin6_family
field should storeAF_INET6
for IPv6 - The
sin6_port
field stores a port number in network byte order. - The
sin6_addr
field stores the IPv6 address -
sin6_flowinfo
andsin6_scope_id
are beyond the scope of what we need, so we'll ignore them.
struct sockaddr_in6 { // IPv6 socket address record
unsigned short sin6_family;
unsigned short sin6_port;
unsigned int sin6_flowinfo;
struct in6_addr sin6_addr;
unsigned int sin6_scope_id;
};
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
int createClientSocket(const string& host, unsigned short port) {
...
struct sockaddr_in address;
memset(&address, 0, sizeof(address));
address.sin_family = AF_INET;
address.sin_port = htons(port);
address.addr = ???;
...
htons is "host to network short" - it converts to network byte order, which may or may not be the same as the byte order your machine uses.
Specify:
- family
- address
- port
We can get the IP address for the server from the struct hostent * from gethostbyname.
- Wait a minute - gethostbyname and gethostbyname2 will give back different info (IPv4 vs. IPv6 addresses). How can the return type be the same?
-
Key Idea: struct hostent will have a generic field in it which is a list of addresses; depending on whether it's IPv4 or IPv6, the list will be of a different type, and we can cast it to that type.
- Why? no use of void * generics back then, so char ** it is.
createClientSocket
struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);
struct hostent {
...
// NULL-terminated list of IP addresses
// This is really a struct in_addr ** when hostent contains IPv4 addresses
// This is really a struct in6_addr ** when hostent contains IPv6 addresses
char **h_addr_list;
...
};
We can get the IP address for the server from the struct hostent * from gethostbyname.
- Key Idea: struct hostent will have a generic field in it which is a list of addresses; depending on whether it's IPv4 or IPv6, the list will be of a different type, and we can cast it to that type.
createClientSocket
struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);
struct hostent {
...
// NULL-terminated list of IP addresses
// This is really a struct in_addr ** when hostent contains IPv4 addresses
// This is really a struct in6_addr ** when hostent contains IPv6 addresses
char **h_addr_list;
...
};
// h_addr is #define for h_addr_list[0]
struct in_addr first_ip = *((struct in_addr *)he->h_addr);
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
int createClientSocket(const string& host, unsigned short port) {
...
struct sockaddr_in address;
memset(&address, 0, sizeof(address));
address.sin_family = AF_INET;
address.sin_port = htons(port);
// h_addr is #define for h_addr_list[0]
address.sin_addr = *((struct in_addr *)he->h_addr);
if (connect(s, (struct sockaddr *) &address, sizeof(address)) == 0) return s;
...
createClientSocket
- Check that the specified server and port are valid - gethostbyname()
- Create a new socket descriptor - socket()
- Associate this socket descriptor with a connection to that server - connect()
- Return the socket descriptor
int createClientSocket(const string& host, unsigned short port) {
struct hostent *he = gethostbyname(host.c_str());
if (he == NULL) return -1;
int s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0) return -1;
struct sockaddr_in address;
memset(&address, 0, sizeof(address));
address.sin_family = AF_INET;
address.sin_port = htons(port);
// h_addr is #define for h_addr_list[0]
address.sin_addr = *((struct in_addr *)he->h_addr);
if (connect(s, (struct sockaddr *) &address, sizeof(address)) == 0) return s;
close(s);
return -1;
}
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
- Extra topic: MapReduce
createServerSocket
- Create a new socket descriptor - socket()
- Bind this socket to a given port and IP address - bind()
- Make the socket descriptor passive to listen for incoming requests - listen()
- Return socket descriptor
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);
createServerSocket
- Create a new socket descriptor - socket()
- Bind this socket to a given port and IP address - bind()
- Make the socket descriptor passive to listen for incoming requests - listen()
- Return socket descriptor
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);
createServerSocket
- Create a new socket descriptor - socket()
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);
int createServerSocket(unsigned short port, int backlog) {
int s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0) return -1;
...
}
createServerSocket
2. Bind this socket to a given port and IP address - bind()
int createServerSocket(unsigned short port, int backlog) {
...
struct sockaddr_in address;
memset(&address, 0, sizeof(address));
address.sin_family = AF_INET;
address.sin_addr.s_addr = htonl(INADDR_ANY);
address.sin_port = htons(port);
if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
...
}
bind "associates a name with a socket"
- says to bind the specified socket to the specified address(es)
- we specify INADDR_ANY to have this socket associated with all of this machine's local IP addresses. Clients can connect to any of the machine's IP addresses via the specified port.
Specify:
- family
- address
- port
createServerSocket
3. Make the socket descriptor passive to listen for incoming requests - listen()
int createServerSocket(unsigned short port, int backlog) {
...
if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
listen(s, backlog) == 0) return s;
...
}
listen makes the specified socket passive - one used for listening via accept.
- backlog is how large the queue of pending connections can get before dropping them.
createServerSocket
int createServerSocket(unsigned short port, int backlog) {
int s = socket(AF_INET, SOCK_STREAM, 0);
if (s < 0) return -1;
struct sockaddr_in address;
memset(&address, 0, sizeof(address));
address.sin_family = AF_INET;
address.sin_addr.s_addr = htonl(INADDR_ANY);
address.sin_port = htons(port);
if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
listen(s, backlog) == 0) return s;
close(s);
return -1;
}
- Create a new socket descriptor - socket()
- Bind this socket to a given port and IP address - bind()
- Make the socket descriptor passive to listen for incoming requests - listen()
- Return socket descriptor
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
- Extra topic: MapReduce
CS110 Extra Topic: How can we parallelize data processing across many machines?
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
-
Extra topic: MapReduce
- Motivation: parallelizing computation
- What is MapReduce?
- [Extra] Further Research
Plan For Today
- Implementing createClientSocket
- Implementing createServerSocket
-
Extra topic: MapReduce
- Motivation: parallelizing computation
- What is MapReduce?
- [Extra] Further Research
Distributed Systems and Computation
- We have learned how concurrency can let us split up large tasks to perform simultaneously.
- We have learned how networking can let us write programs that communicate across machines.
- What happens if we put these two ideas together?
-
Key Idea: take a task that isn't feasible to perform on one machine, and split it up over many machines that coordinate over the network.
- Distributed systems: systems that are spread out over multiple machines that coordinate with each other.
- MapReduce is a system that lets us easily split certain kinds of tasks among many machines. But first, let's explore the general idea of distributed computation.
Parallelizing Programs
Task: we want to count the frequency of words in a document.
Possible Approach: program that reads document and builds a word -> frequency map
How can we parallelize this?
Idea: split document into pieces, count words in each piece concurrently
Problem: what if a word appears in multiple pieces? We need to then merge the counts.
Idea: combine all the output, sort it, split into pieces, combine in each one concurrently
Example: Counting Word Frequencies
Idea: split document into pieces, count words in each piece concurrently. Then, combine all the text output, sort it, split into pieces, sum each one concurrently.
Example: "the very very quick fox greeted the brown fox"
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
Sorted
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Example: Counting Word Frequencies
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
Sorted
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
2 "phases" where we parallelize work
- map the input to some intermediate data representation
- reduce the intermediate data representation into final result
Example: Counting Word Frequencies
The first phase focuses on finding, and the second phase focuses on summing. So the first phase should only output 1s, and leave the summing for later.
Example: "the very very quick fox greeted the brown fox"
the very very
quick fox greeted
the brown fox
the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
...
the, 1
very, 1
very, 1
Example: Counting Word Frequencies
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
Sorted
2 "phases" where we parallelize work
- map the input to some intermediate data representation
- reduce the intermediate data representation into final result
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Example: Counting Word Frequencies
the very very
quick fox greeted
the brown fox
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
Combined
Sorted
the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1
brown, 1
fox, 2
greeted, 1
quick, 1
the, 2
very, 2
Question: is there a way to parallelize this operation as well?
Recap
- Implementing createClientSocket
- Implementing createServerSocket
-
Extra topic: MapReduce
- Motivation: parallelizing computation
Next time: more MapReduce and CS110 wrap-up / systems principles
CS110 Lecture 24: Networking System Calls and MapReduce
By Nick Troccoli
CS110 Lecture 24: Networking System Calls and MapReduce
- 1,961