CS110 Lecture 24: Networking Functions and MapReduce

CS110: Principles of Computer Systems

Winter 2021-2022

Stanford University

Instructors: Nick Troccoli and Jerry Cain

The Stanford University logo

CS110 Topic 4: How can we write programs that communicate over a network with other programs?

Learning About Networking

Introduction to  Networking

Servers / HTTP

HTTP and APIs

Networking System Calls / Library Functions

Lecture 20

Lecture 21

Lecture 22/23

Today

assign6: implement an HTTP Proxy that sits between a client device and a web server to monitor, block or modify web traffic.

Learning Goals

  • Learn about the implementations of createClientSocket and createServerSocket
  • Apply our knowledge of networking and concurrency to understand MapReduce
  • Learn about the MapReduce library and how it parallelizes operations
  • Understand how to write a program that can be run with MapReduce

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce

createClientSocket and createServerSocket

Let's see the underlying system calls and library functions needed to implement createClientSocket and createServerSocket!​

  • Goal: to see the kinds of functions required (you won't have to re-implement createClientSocket or createServerSocket)
  • Goal: to see the design decisions and language workarounds involved

Clients

We have used createClientSocket in client programs so far to connect to servers.  It gives us back a descriptor we can use to read/write data.

But how is the createClientSocket helper function actually implemented?

int main(int argc, char *argv[]) {
  // Open a connection to the server
  int socketDescriptor = createClientSocket("myth64.stanford.edu", 12345);

  // Read in the data from the server (sockbuf descructor closes descriptor)
  sockbuf socketBuffer(socketDescriptor);
  iosockstream socketStream(&socketBuffer);
  string timeline;
  getline(socketStream, timeline);

  // Print the data from the server
  cout << timeline << endl;

  return 0;
}

createClientSocket

  1. Check that the specified server and port are valid
  2. Create a new socket descriptor
  3. Associate this socket descriptor with a connection to that server
  4. Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
  4. Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
  4. Return the socket descriptor
int createClientSocket(const string& host, unsigned short port);
  • We check the validity of the host by attempting to look up their IP address
  • gethostbyname() gets IPV4 host info for the given name (e.g. "www.facebook.com")
  • gethostbyname2() can get IPV6 host info for the given name - second param can be AF_INET (for IPv4) or AF_INET6 (for IPv6)
  • Technically deprecated in favor of getAddrInfo, but still prevalent and good to know
  • Both return a statically allocated struct hostent with host's info (or NULL if error)
struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()

createClientSocket

int createClientSocket(const string& host, unsigned short port) {
    struct hostent *he = gethostbyname(host.c_str());
    if (he == NULL) return -1;
    ...

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()

 

 

 

 

int socket(int domain, int type, int protocol);

int createClientSocket(const string& host, unsigned short port) {
    ...
    int s = socket(AF_INET, SOCK_STREAM, 0);
    if (s < 0) return -1;
    ...

The socket function creates a socket endpoint and returns a descriptor.  

  • The first parameter is the protocol family (IPv4, IPv6, Bluetooth, etc.).  
  • The second parameter is the type of the connection - do we want a reliable 2-way connection, unreliable (but faster) connection, etc.?
  • The third parameter is the protocol (0 for default)

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);

connect connects the specified socket to the specified address.

  • Wait a minute - we could be using IPv4 or IPv6.  How can we have the same parameter types for both?

connect()

int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);

There are actually multiple different types of we may want to pass in. sockaddr_in and sockaddr_in6.  How can we handle these possibilities?  C doesn't support inheritance or templates.

  • First idea: we could make a new version of connect for each type (not great)
  • Second idea: we could specify the parameter type as void * (but then how would we know the real type?)
  • Third idea: we could make the parameter type a "parent type" called sockaddr, which will have the same memory layout as sockaddr_in and sockaddr_in6.  
    • Its structure is a 2 byte type field followed by 14 bytes of something.
    • Both sockaddr_in and sockaddr_in6 will start with that 2 byte type field, and use the remaining 14 bytes for whatever they want.
    • connect can then check the type field before casting to the appropriate type

connect()

int connect(int clientfd, const struct sockaddr *addr, socklen_t addrlen);

We will make the parameter type a "parent type" called sockaddr, which will have the same memory layout as sockaddr_in and sockaddr_in6.  Its structure is a 2 byte type field followed by 14 bytes of something.  Both sockaddr_in and sockaddr_in6 will start with that 2 byte type field, and use the remaining 14 bytes for whatever they want.

struct sockaddr { // generic socket
    unsigned short sa_family; // protocol family for socket
    char sa_data[14];
    // address data (and defines full size to be 16 bytes)
};
struct sockaddr_in { // IPv4 socket address record
    unsigned short sin_family;
    unsigned short sin_port;
    struct in_addr sin_addr;
    unsigned char sin_zero[8];
};
struct sockaddr_in6 { // IPv6 socket address record
    unsigned short sin6_family;
    unsigned short sin6_port;
    unsigned int sin6_flowinfo;
    struct in6_addr sin6_addr;
    unsigned int sin6_scope_id;
};

sockaddr_in

struct sockaddr_in { // IPv4 socket address record
    unsigned short sin_family;
    unsigned short sin_port;
    struct in_addr sin_addr;
    unsigned char sin_zero[8];
};
  • The sin_family field should store AF_INET for IPv4
  • The sin_port field stores a port number in network byte order.
    • ​Different machines may store multi-byte values in different orders (big endian, little endian).  But network data must be sent in a consistent format.
  • The sin_addr field stores the IPv4 address
  • The sin_zero field represents the remaining 8 bytes that are unused.

sockaddr_in6

  • The sin6_family field should store AF_INET6 for IPv6
  • The sin6_port field stores a port number in network byte order.
  • The sin6_addr field stores the IPv6 address
  • sin6_flowinfo and sin6_scope_id are beyond the scope of what we need, so we'll ignore them.
struct sockaddr_in6 { // IPv6 socket address record
    unsigned short sin6_family;
    unsigned short sin6_port;
    unsigned int sin6_flowinfo;
    struct in6_addr sin6_addr;
    unsigned int sin6_scope_id;
};

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
int createClientSocket(const string& host, unsigned short port) {
    ...
    struct sockaddr_in address;
    memset(&address, 0, sizeof(address));
    address.sin_family = AF_INET;
    address.sin_port = htons(port);
    address.addr = ???;
    ...

htons is "host to network short" - it converts to network byte order, which may or may not be the same as the byte order your machine uses.

Specify:

  • family
  • address
  • port

We can get the IP address for the server from the ​struct hostent * from gethostbyname.

  • Wait a minute - gethostbyname and gethostbyname2 will give back different info (IPv4 vs. IPv6 addresses).  How can the return type be the same? 
  • Key Idea: struct hostent will have a generic field in it which is a list of addresses; depending on whether it's IPv4 or IPv6, the list will be of a different type, and we can cast it to that type.
    • Why?  no use of void * generics back then, so char ** it is.

createClientSocket

struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);
struct hostent {
    ...

    // NULL-terminated list of IP addresses
    // This is really a struct in_addr ** when hostent contains IPv4 addresses
    // This is really a struct in6_addr ** when hostent contains IPv6 addresses
    char **h_addr_list; 
    
    ...
}; 

We can get the IP address for the server from the ​struct hostent * from gethostbyname.

  • Key Idea: struct hostent will have a generic field in it which is a list of addresses; depending on whether it's IPv4 or IPv6, the list will be of a different type, and we can cast it to that type.

createClientSocket

struct hostent *gethostbyname(const char *name);
struct hostent *gethostbyname2(const char *name, int af);
struct hostent {
    ...

    // NULL-terminated list of IP addresses
    // This is really a struct in_addr ** when hostent contains IPv4 addresses
    // This is really a struct in6_addr ** when hostent contains IPv6 addresses
    char **h_addr_list; 
    
    ...
}; 
// h_addr is #define for h_addr_list[0]
struct in_addr first_ip = *((struct in_addr *)he->h_addr);

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
int createClientSocket(const string& host, unsigned short port) {
    ...
    struct sockaddr_in address;
    memset(&address, 0, sizeof(address));
    address.sin_family = AF_INET;
    address.sin_port = htons(port);

    // h_addr is #define for h_addr_list[0]
    address.sin_addr = *((struct in_addr *)he->h_addr); 
    if (connect(s, (struct sockaddr *) &address, sizeof(address)) == 0) return s;
    ...

createClientSocket

  1. Check that the specified server and port are valid - gethostbyname()
  2. Create a new socket descriptor - socket()
  3. Associate this socket descriptor with a connection to that server - connect()
  4. Return the socket descriptor
int createClientSocket(const string& host, unsigned short port) {
    struct hostent *he = gethostbyname(host.c_str());
    if (he == NULL) return -1;
    int s = socket(AF_INET, SOCK_STREAM, 0);
    if (s < 0) return -1;
    struct sockaddr_in address;
    memset(&address, 0, sizeof(address));
    address.sin_family = AF_INET;
    address.sin_port = htons(port);

    // h_addr is #define for h_addr_list[0]
    address.sin_addr = *((struct in_addr *)he->h_addr); 
    if (connect(s, (struct sockaddr *) &address, sizeof(address)) == 0) return s;

    close(s);
    return -1;
}

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce

createServerSocket

  1. Create a new socket descriptor - socket()
  2. Bind this socket to a given port and IP address - bind()
  3. Make the socket descriptor passive to listen for incoming requests - listen()
  4. Return socket descriptor
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);

createServerSocket

  1. Create a new socket descriptor - socket()
  2. Bind this socket to a given port and IP address - bind()
  3. Make the socket descriptor passive to listen for incoming requests - listen()
  4. Return socket descriptor
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);

createServerSocket

  1. Create a new socket descriptor - socket()
int createServerSocket(unsigned short port, int backlog = kDefaultBacklog);
int createServerSocket(unsigned short port, int backlog) {
    int s = socket(AF_INET, SOCK_STREAM, 0);
    if (s < 0) return -1;
    ...
}

createServerSocket

2. Bind this socket to a given port and IP address - bind()

int createServerSocket(unsigned short port, int backlog) {
    ...
    struct sockaddr_in address;
    memset(&address, 0, sizeof(address));
    address.sin_family = AF_INET;
    address.sin_addr.s_addr = htonl(INADDR_ANY);
    address.sin_port = htons(port);
    if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
		...

}

bind "associates a name with a socket"

  • says to bind the specified socket to the specified address(es)
  • we specify INADDR_ANY to have this socket associated with all of this machine's local IP addresses.   Clients can connect to any of the machine's IP addresses via the specified port.

Specify:

  • family
  • address
  • port

createServerSocket

3. Make the socket descriptor passive to listen for incoming requests - listen()

int createServerSocket(unsigned short port, int backlog) {
    ...
    if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
            listen(s, backlog) == 0) return s;

    ...
}

listen makes the specified socket passive - one used for listening via accept.

  • backlog is how large the queue of pending connections can get before dropping them.

createServerSocket

int createServerSocket(unsigned short port, int backlog) {
    int s = socket(AF_INET, SOCK_STREAM, 0);
    if (s < 0) return -1;
    struct sockaddr_in address;
    memset(&address, 0, sizeof(address));
    address.sin_family = AF_INET;
    address.sin_addr.s_addr = htonl(INADDR_ANY);
    address.sin_port = htons(port);
    if (bind(s, (struct sockaddr *)&address, sizeof(address)) == 0 &&
            listen(s, backlog) == 0) return s;

    close(s);
    return -1;
}
  1. Create a new socket descriptor - socket()
  2. Bind this socket to a given port and IP address - bind()
  3. Make the socket descriptor passive to listen for incoming requests - listen()
  4. Return socket descriptor

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce

CS110 Extra Topic: How can we parallelize data processing across many machines?

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce
    • ​Motivation: parallelizing computation
    • What is MapReduce?
    • [Extra] Further Research

Plan For Today

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce
    • ​Motivation: parallelizing computation
    • What is MapReduce?
    • [Extra] Further Research

Distributed Systems and Computation

  • We have learned how concurrency can let us split up large tasks to perform simultaneously.
  • We have learned how networking can let us write programs that communicate across machines.
  • What happens if we put these two ideas together?
  • Key Idea: take a task that isn't feasible to perform on one machine, and split it up over many machines that coordinate over the network.
    • Distributed systems: systems that are spread out over multiple machines that coordinate with each other.
  • MapReduce is a system that lets us easily split certain kinds of tasks among many machines.  But first, let's explore the general idea of distributed computation.

Parallelizing Programs

Task: we want to count the frequency of words in a document.

Possible Approach:​ program that reads document and builds a word -> frequency map

​How can we parallelize this?

Idea: split document into pieces, count words in each piece concurrently

Problem: what if a word appears in multiple pieces?  We need to then merge the counts.

Idea: combine all the output, sort it, split into pieces, combine in each one concurrently

Example: Counting Word Frequencies

Idea: split document into pieces, count words in each piece concurrently.  Then, combine all the text output, sort it, split into pieces, sum each one concurrently.

Example: "the very very quick fox greeted the brown fox"

the very very

quick fox greeted

the brown fox

the, 1

very, 2

quick, 1

fox, 1

greeted, 1

the, 1

brown, 1

fox, 1

the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1

Combined

brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2

Sorted

brown, 1

fox, 1

fox, 1

greeted, 1

quick, 1

the, 1

the, 1

very, 2

brown, 1

fox, 2

greeted, 1

quick, 1

the, 2

very, 2

Example: Counting Word Frequencies

the very very

quick fox greeted

the brown fox

the, 1

very, 2

quick, 1

fox, 1

greeted, 1

the, 1

brown, 1

fox, 1

the, 1
very, 2
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1

Combined

brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 2

Sorted

brown, 1

fox, 1

fox, 1

greeted, 1

quick, 1

the, 1

the, 1

very, 2

brown, 1

fox, 2

greeted, 1

quick, 1

the, 2

very, 2

2 "phases" where we parallelize work

  1. map the input to some intermediate data representation
  2. reduce the intermediate data representation into final result

Example: Counting Word Frequencies

The first phase focuses on finding, and the second phase focuses on summing.  So the first phase should only output 1s, and leave the summing for later.

Example: "the very very quick fox greeted the brown fox"

the very very

quick fox greeted

the brown fox

the, 1

very, 2

quick, 1

fox, 1

greeted, 1

the, 1

brown, 1

fox, 1

...

the, 1

very, 1

very, 1

Example: Counting Word Frequencies

the very very

quick fox greeted

the brown fox

the, 1

very, 1

very, 1

quick, 1

fox, 1

greeted, 1

the, 1

brown, 1

fox, 1

Combined

Sorted

2 "phases" where we parallelize work

  1. map the input to some intermediate data representation
  2. reduce the intermediate data representation into final result

the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1

brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1

brown, 1

fox, 1

fox, 1

greeted, 1

quick, 1

the, 1

the, 1

very, 1

very, 1

brown, 1

fox, 2

greeted, 1

quick, 1

the, 2

very, 2

Example: Counting Word Frequencies

the very very

quick fox greeted

the brown fox

the, 1

very, 1

very, 1

quick, 1

fox, 1

greeted, 1

the, 1

brown, 1

fox, 1

Combined

Sorted

the, 1
very, 1
very, 1
quick, 1
fox, 1
greeted, 1
the, 1
brown, 1
fox, 1

brown, 1
fox, 1
fox, 1
greeted, 1
quick, 1
the, 1
the, 1
very, 1
very, 1

brown, 1

fox, 1

fox, 1

greeted, 1

quick, 1

the, 1

the, 1

very, 1

very, 1

brown, 1

fox, 2

greeted, 1

quick, 1

the, 2

very, 2

Question: is there a way to parallelize this operation as well?

Recap

  • Implementing createClientSocket
  • Implementing createServerSocket
  • Extra topic: MapReduce
    • ​Motivation: parallelizing computation

Next time: more MapReduce and CS110 wrap-up / systems principles