Kirk Haines
khaines@engineyard.com
RailsConf 2016
Let's Build a Web Server!
https://github.com/engineyard/railsconf2016-webservers
Kirk Haines
• Rubyist since 2001
• First professional Ruby web app in 2002
• Dozens of sites and apps, and several web servers since
• Engine Yard since 2008
• Former Ruby 1.8.6 maintainer
khaines@engineyard.com
@wyhaines
What Is a Web Server?
A web server is an information technology that processes requests via HTTP, the basic network protocol used to distribute information on the World Wide Web.
https://en.wikipedia.org/wiki/Web_server
A web server is just a server that accepts HTTP requests, and returns HTTP responses.
In The Beginning...
- Tim Berners-Lee, of CERN
- Information sharing between scientists
- Original web browser and web server implementations
- NeXT and Apple Macintosh - 50€ / server
- Source code - 50€ / site
- Open sourced on April 30, 1993
What Does It Look Like?
HTTP
HyperText Transfer Protocol
Defines:
How to make a request to a HTTP server (a web server)
How the server should respond to requests
Very Simple Beginnings
Request a document
GET document-path
Valid response was simply to return the document.
"A Shell Server for HTTP"
(slightly modified from the original)
http://info.cern.ch/hypertext/WWW/Provider/ShellScript.html
#! /bin/sh
read get docid junk
cat `echo "$docid"` | \
ruby -p -e '$_.gsub(/^\/ [\r\n]/,"").chomp'
Run the Server
Access the Server
That's Terrifying
i.e. Don't Actually Do This
(super_simple.sh)
A (scary, limited) web server!
This is easy!!!
#! /bin/sh
read get docid junk
cat `echo "$docid" | \
ruby -p -e '$_.gsub!(/^\/ [\r\n]/,"").chomp'`
netcat -l -p 5000 -e ./super_simple.sh
That Original Server...
Text
for (;;) {
status = read(soc, command, COMMAND_SIZE);
if (status<=0) {
if (TRACE) printf("read returned %i, errno=%i\n", status, errno);
return status; /* Return and close file */
}
command[status]=0; /* terminate the string */
#ifdef VM
{
char * p;
for (p=command; *p; p++) {
*p = FROMASCII(*p);
if (*p == '\n') *p = ' ';
}
}
#endif
if (TRACE) printf("read string is `%s'\n", command);
arg=index(command,' ');
if (arg) {
*arg++ = 0; /* Terminate command & move on */
arg = strip(arg); /* Strip leading & trailing space */
if (0==strcmp("GET", command)) { /* Get a file */
/* Remove host and any punctuation. (Could use HTParse to remove access too @)
*/
filename = arg;
if (arg[0]=='/') {
if (arg[1]=='/') {
filename = index(arg+2, '/'); /* Skip //host/ */
if (!filename) filename=arg+strlen(arg);
} else {
filename = arg+1; /* Assume root: skip slash */
}
}
if (*filename) { /* (else return test message) */
keywords=index(filename, '?');
if (keywords) *keywords++=0; /* Split into two */
https://github.com/NotTheRealTimBL/WWWDaemon/blob/master/old/V0.1/daemon.c#L214
Monolithic Early Versions
/* TCP/IP based server for HyperText TCPServer.c
** ---------------------------------
**
** History:
** 2 Oct 90 Written TBL. Include filenames for VM from RTB.
*/
It's a file, daemon.c, that changed the direction of the internet forever.
https://github.com/NotTheRealTimBL/WWWDaemon/blob/master/old/V0.1/daemon.c#L214
By Open Source Release, Better Organized
Anchor.h HTFTP.h HTStyle.h HyperAccess.h
NewsAccess.m TextToy.m Anchor.m HTFile.c
HTStyle.m HyperAccess.m ParseHTML.h WWW.h
FileAccess.h HTFile.h HTTCP.c HyperManager.h
StyleToy.h WWWPageLayout.h FileAccess.m HTParse.c
HTTCP.h HyperManager.m StyleToy.m WWWPageLayout.m
HTAccess.c HTParse.h HTTP.c HyperText.h
TcpAccess.h WorldWideWeb_main.m HTAccess.h HTString.c
HTTP.h HyperText.m TcpAccess.m tcp.h
HTFTP.c HTString.h HTUtils.h NewsAccess.h
TextToy.h
This code set the stage for the internet as we know it.
Just Because....
/* TCP/IP based server for HyperText HTDaemon.c
** ---------------------------------
**
**
** Compilation options:
** RULES If defined, use rule file and translation table
** DIR_OPTIONS If defined, -d options control directory access
**
** Authors:
** TBL Tim Berners-Lee, CERN
** JFG Jean-Francois Groff, CERN
** JS Jonathan Streets, FNAL
**
** History:
** Sep 91 TBL Made from earlier daemon files. (TBL)
** 26 Feb 92 JFG Bug fixes for Multinet.
** 8 Jun 92 TBL Bug fix: Perform multiple reads in case we don't get
** the whole command first read.
** 25 Jun 92 JFG Added DECNET option through TCP socket emulation.
** 6 Jan 93 TBL Plusses turned to spaces between keywords
** 7 Jan 93 JS Bug fix: addrlen had not been set for accept() call
*/
/* (c) CERN WorldWideWeb project 1990-1992. See Copyright.html for details */
Things Evolved...
HTTP 0.9 written as implemented
- Assume TCP; default to port 80
- Server accepts connection
- Client sends ASCII request: GET document-address\r\n
- or a search: GET document-address?search+term\r\n
- Server returns LF or CR/LF delimited HTML lines in response
- Plain text responses are preceded by a <PLAINTEXT> line
- Errors are human readable HTML and aren't differentiated from a non-error response
- Server disconnects when the document is transmitted
https://www.w3.org/Protocols/HTTP/AsImplemented.html
CERN provided:
pseudocode of implementation architecture
very basic examples, including shell script examples
and some amazingly worded advice:
If you know the perl language, then that is a powerful (if otherwise incomprehensible) language with which to hack together a server.
Easy to Implement
An Important Rule
If an implementation receives a header that it does not understand, it must ignore the header.
HTTP 1.0
Described in RFC 1945, from May 1996, HTTP 1.0 documented "common usage" instead of being a formal specification. That is, like HTTP 0.9, HTTP 1.0 described core features of extant usage of HTTP.
20 years later, the capabilities and protocol structure described in HTTP 1.0 still form the core of communication content and structure between HTTP clients and servers
http://www.isi.edu/in-notes/rfc1945.txt
Wildly Successful
- GET
- POST
- HEAD
- PUT
- DELETE
- OPTIONS
- CONNECT
- other extensions allowed
More Diverse Request Methods
Additional Data Headers
Allowed the client to pass additional information to the server via "HTTP Headers". These are composed of lines, terminated with CR/LF, and formatted as:
key: value
Core HTTP 1.0 Request Headers
- Authorization
- From
- If-Modified-Since
- Referer
- User-Agent
- Pragma: no-cache
Expanded Server Responses
Status Line
All responses are initiated with a status line.
Indicates the HTTP protocol version in effect, a numeric code for the type of response, and a phrase providing a reason for the response. Lines are terminated with LF or CR/LF.
For example:
HTTP/1.0 500 Internal Service Error
Status Codes
Let the client know what kind of response this is.
Code Range | Purpose |
---|---|
1xx: Informational | Not used, but reserved for future use |
2xx: Success | The action was successfully received, understood, and accepted. |
3xx: Redirection | Further action must be taken in order to complete the request |
4xx: Client Error | The request contains bad syntax or cannot be fulfilled |
5xx: Server Error | The server failed to fulfill an apparently valid request |
Specific Status Codes
Code | Description | Code | Description |
---|---|---|---|
200 | OK | 400 | Bad Request |
201 | Created | 401 | Unauthorized |
202 | Accepted | 403 | Forbiden |
204 | No Content | 404 | Not Found |
301 | Moved Permanently | 500 | Internal Server Error |
302 | Moved Temporarily | 501 | Not Implemented |
304 | Not Modified | 502 | Bad Gateway |
503 | Service Unavailable |
Response/Entity Headers
Allow the server to provide additional information about the response itself, or the payload of the response, to the client. These are ASCII text terminated with a LF or CR/LF, in the same format as the request headers.
Response Header | Entity Header |
---|---|
Location | Allow |
Server | Content-Encoding |
WWW-Authenticate | Content-Length |
Content-Type | |
Expires | |
Last-Modified |
Use What Makes Sense
Practical Minimum Header Sets
200 OK - Barebones Response
Content-Length: 12345
Content-Type: text/html
Server: WEBrick/1.3.1 (Ruby/2.3.0/2015-12-25) Content-Length: 12345 Content-Type: text/html Expires: Sun, 08 May 2016 16:07:32 GMT Last-Modified: Sun, 17 Apr 2016 16:06:50 GMT
302 Moved Temporarily
Location: http://foo.com/blah.html
Entity Body
- Preceded by a single CR/LF line
- Encoded as appropriate for the Content-Encoding and Content-Type
- If Content-Length is provided, client will wait for that many bytes of data, and may disregard any extra data. Otherwise client accepts data until server disconnects.
Putting It All Together
HTTP/0.9
HTTP 0.9 Request
Connect to server
Simple Response
No Status Line
No Headers
Putting It All Together
HTTP/1.0
HTTP 1.0 Request
Connect to server
Status line
Headers
Putting It All Together
HTTP/2.0
HTTP 1.0 Request
Connect to server
Status line
Headers
Doesn't immediately disconnect.....
Connection: Keep-Alive
Putting It All Together
Did You Notice?
HTTP/1.1 200 OK
Etag: 372b7ab-be-5713bcf2
Content-Type: text/html
Content-Length: 190
Last-Modified: Sun, 17 Apr 2016 16:42:26 GMT
Server: WEBrick/1.3.1 (Ruby/2.3.0/2015-12-25)
Date: Sun, 17 Apr 2016 17:15:54 GMT
Connection: close
<html>
<head>
<title>Example Doc for HTTP 1.0 Request</title>
</head>
<body>
<p>Here lies content.</p>
<p>It lived a simple life.</p>
</body>
</html>
HTTP/1.1 ?
What are these?
Etag
Date
Connection
A Server Isn't Strictly Constrained by Request Protocol
- Clients ignore what they don't understand.
- HTTP is generally backwards compatible.
- HTTP 1.0 amended to HTTP 1.1 after 3 years; this added more now-ubiquitous near-essential headers.
- Date - The time from the server point of view
- ETag - Hash value of the content; helps caches
- Connection - What happens after the response is returned
- WEBrick returns HTTP 1.1 responses for any HTTP 1.x requests.
https://tools.ietf.org/html/rfc2616
HTTP 1.1
The De Facto Modern Standard
RFC 2616, from June 1999
HTTP 1.0 was still lacking in some features that turned out to be very useful. HTTP 1.1 filled those gaps, and 17 years later it's still the backbone of HTTP server capability.
https://www.ietf.org/rfc/rfc2616.txt
Key Protocol Changes
- Send a Date if at all possible
- Connection Management
- Host Identification
- Message Transmission
- Bandwidth Optimization
- Caching
- Security
- Content Negotiation
- Future Proofing
- Errors
Connection: Keep-Alive
- A single page may require many assets to assemble it.
- Setup/teardown of many TCP requests is expensive.
- Expensive == slow
"Connection: Keep-Alive" permits pipelining of requests through a single network connection between client and server.
Not part of RFC 1945; effectively bolted onto HTTP 1.0 because of it's utility, but actually part of HTTP 1.1.
Connection: Keep-Alive
HTTP 1.0 clients must send "Connection: Keep-Alive" if they
support it. Otherwise, an HTTP 1.1+ server will assume no support and send:
Connection: close
"Close" tells the client that the connection will be closed after the complete response has been sent.
Connection: Keep-Alive
"Connection: close" is always safe. A client that has specified Keep-Alive will cope if the server doesn't honor it and returns "Connection: close"
HTTP 2.0 has much expanded Keep-Alive, supporting concurrent requests over a single connection.
HTTP 1.1+ servers should assume Keep-Alive unless client tells it otherwise. However...
Connection and
Single-Hop Headers
HTTP 1.1 allows special handling for headers that apply for only a single hop in a message's path.
For example, proxy servers that might add headers separate from what came from the originating client.
Any headers listed in the Connection header are removed by
the recipient before forwarding the message to the next hop.
Host
Prior to HTTP 1.1, an HTTP server assumed that all request to it were for the same host/site.
i.e. site.a.foo.com and site.b.foo.com both go to a server at 192.168.23.23, an HTTP 1.0 server treats them both the same.
An HTTP 1.1+ server, though, can differentiate requests on 192.168.23.23 for site.a.foo.com and site.b.foo.com, and serve different responses for each.
Virtual Hosting
Message Transmission
Imagine:
GET /media/giant_file.mpeg HTTP/1.0
HTTP/1.0 200 OK
Content-Type: video/mp4
Content-Length: 540387995
Connection: close
That is a lot of data to send in one big chunk, or for the server to potentially process in one big chunk. A Ruby string containing more than 500 megabytes would use a lot of RAM, for example.
Message Transmission
Wouldn't it be nice if a very large response, or one where the length isn't know when transmission starts, could be sent a little bit at a time?
Message Transmission
Chunked encoding to the rescue.
Send content in smaller pieces, by prepending the length of each chunk, in hexadecimal, on on a line preceding the chunk itself.
Transfer-Encoding: chunked
<= Recv data, 105 bytes (0x69)
0000: 10
0004: I am some conten
0016: 10
001a: t that will be p
002c: 10
0030: arted out into a
0042: 10
0046: bunch of small
0058: 7
005b: chunks.
0064: 0
0067:
Message Transmission
- Lines terminated with CRLF
- Chunk is a length line, expressed in hexadecimal, followed by a content line.
- Chunks are terminated by an empty chunk with a length of zero.
- Content-Length must not be sent when using chunked encoding.
Transfer-Encoding: chunked
<= Recv data, 105 bytes (0x69)
0000: 10
0004: I am some conten
0016: 10
001a: t that will be p
002c: 10
0030: arted out into a
0042: 10
0046: bunch of small
0058: 7
005b: chunks.
0064: 0
0067:
Message Transmission
Beware Content-Length + Transfer-Encoding: chunked
If you are writing a server, chunked encoding is very useful for large content. However, beware code that blindly computes and attaches a Content-Length to all HTTP responses.
It "works", but it recently quit working in some cases - Windows 7 Chrome streaming to an external app like Adobe Reader, for example.
It's invalid HTTP, and so what "works" now may not tomorrow.
Bandwidth Optimization
Bandwidth is a very limited resource.
HTTP 1.0 had limited tools to allow clients and servers to make more efficient use of their bandwidth.
Compression was supported, but poorly.
Partial requests, or incremental requests, were not supported at all.
Bandwidth Optimization
HTTP 1.0 didn't distinguish between content encodings, such as compression, that applied to a message end-to-end versus hop-to-hop.
HTTP 1.0 had poor support for negotiating compression.
HTTP 1.1 specifically and extensively defines the protocol for both negotiation, and disambiguation between encoding inherent in the format of the message (end-to-end encoding) and encoding applied only for a single hop.
Compression
Bandwidth Optimization
Content-Encoding
This describes an encoding which is an inherent quality of the resource. It can be used by a server when returning a message to a client.
Compression
Bandwidth Optimization
Accept-Encoding
HTTP 1.1 carefully defines this header, which is used by a client to tell the server about encodings (such as gzip) that it
can support for Content-Encoding, and it's preferences.
Compression
Accept-Encoding:gzip, deflate, sdch
Bandwidth Optimization
Transfer-Encoding
This is intended to describe the hop-by-hop transport layer encoding of the resource. The intention of the HTTP protocol is that this header will be used when a server compresses content, such as a web page, on the fly, before transmission to a client in order to save bandwidth.
Compression
Bandwidth Optimization
TE
The TE header is akin to Accept-Encoding, but applies to the use of the Transfer-Encoding header. In addition to defining encodings, it can also be used to indicate whether the client is willing to accept trailer fields with chunked encoding. Read the RFC for more information about trailer fields.
Compression
TE:trailers, gzip; q=0.8, deflate; q=0.5
Bandwidth Optimization
Content-Encoding
Compression
Transfer-Encoding
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
Actual implementation in the real world of HTTP sometimes varies.
Imagine
if you
will...
A client is willing to accept on-the-fly compression of web pages and other uncompressed assets.
A server is willing to send them.
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
TE:trailers, gzip; q=0.8, deflate; q=0.5
By RFC, client can send:
And server may respond:
Transfer-Encoding: gzip
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
In practice, your mileage may vary; most implementations use Accept-Encoding and Content-Encoding to negotiate and deliver on-the-fly compression of resources.
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
Accept-Encoding:gzip, deflate, sdch
Client actually sends:
Server then responds with:
Content-Encoding:gzip
Bandwidth Optimization
Content-Encoding vs Transfer-Encoding
Compression
HTTP gets complicated very quickly in the real world, so be careful with features you choose to support and implement.
Another example. The deflate encoding is defined by RFC to be data compressed with the deflate (RFC 1951) algorithm, formatted into a zlib (RFC 1950) stream. Microsoft clients and servers historically treated it as a raw deflate stream, however, which was incompatible with an RFC compliant implementation.
Bandwidth Optimization
TL;DR Summary
Compression
Read the RFC, then check how it's actually implemented.
Bandwidth Optimization
Range Requests
Allows retrieval of only a fragment of a resource.
- Continue a failed transfer
- Tail a large, growing file (e.x. web version of tail -f /var/log/messages)
- Retrieve document previews
- Retrieve a very large file over an extended period of time via range requests for smaller, manageable chunks.
Content-Negotiation
The world has many languages, character sets, encodings, and media types. HTTP 1.0 provided a mechanism for clients to express content preferences, but it was ambiguously specified.
HTTP 1.1 is much more explicit and expansive.
Accept-Language: en, es;q=0.5, nl;q=0.1
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Accept: text/html;q=1.0, text/*;q=0.8,
image/png;q=0.7, image/*;q=0.5,
*/*;q=0.1
Accept-Charset: iso-8859-1, utf-8;q=0.9
Accept-Encoding: gzip, compress;q=0.8, deflate;q=0.1
English, Spanish, or Dutch as a last resort
HTML, or other text formats, PNGs, other image formats, or anything else.
ISO-8859-1 or UTF-8
GZip, compress, or deflate as a last resort.
Caching
Caching is love.
HTTP 1.0 supported caching.
HTTP 1.1 improves it.
- ETag
- Conditional Headers
- Cache-Control
- Vary
Caching
Last-Modified
A server should send a Last-Modified header with a response, provided that there is a reasonable and consistent way to determine this.
Ruby makes it simple to generate a properly formatted date:
require 'time'
File.mtime( resource ).httpdate
Tue, 15 Sep 2015 13:21:54 GMT
Caching
ETag
HTTP/1.1 200 OK
ETag: ab788a046ac8c135891669d8531d6fa9
Content-Type: image/jpeg
Content-Length: 114962
Transfer-Encoding: chunked
Date: Thu, 28 Apr 2016 12:25:35 GMT
An ETag is an entity tag. It is an opaque indicator of the uniqueness of a resource.
i.e. it is a hash
A server may construct the ETag in any way. The only requirement is that the ETag be comparable to others from the same server, and that identical tags indicate an identical resource.
Caching
Conditional Headers
As with everything in HTTP, the full specification is pretty involved: https://tools.ietf.org/html/rfc7232
- If-Match
- If-None-Match
- If-Modified-Since
- If-Unmodified-Since
- If-Range
Caching
Cache-Control
Also complex:
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9
May be used in both by both clients and servers, in both requests and responses, to modify caching behavior.
Directives such as max-age can ensure relatively fresh responses, even in the face of clock skew, while the private and no-store directives help to hedge against caches keeping private data, or data that is fresh only for the immediate response. See the RFC for full details.
Caching
Vary
Content negotiation means that a single URL can potentially return multiple different resources.
Vary tells a caching implementation to consider additional URLs when calculating an ETag to use in comparing content for cache validation purposes.
For example:
Vary: Accept-Language
Factor the contents of Accept-Language into the generated ETag.
Security
Authentication
HTTP/1.1 401 Unauthorized
Date: Sat, 30 Apr 2016 23:39:59 GMT
WWW-Authenticate: Basic realm="sekrit_stuff"
Expires: Sat, 30 Apr 2016 23:39:59 GMT
Last-Modified: Thu, 28 Apr 2016 19:54:23 GMT
Content-Length: 0
Content-Type: text/html
Client (typically) queries user for username/password, then reissues the request with an Authentication header containing those attributes. Client may cache and continue sending these attributes for any requests for the same realm.
Weakness of this is that the username / password is transmitted in clear text.
HTTP/1.0 defined Basic authentication, a challenge/response mechanism for authenticated access to a resource.
Server responds to a request with a WWW-Authenticate header.
Security
Digest Authentication
Full details at: https://tools.ietf.org/html/rfc2617
Same fundamental sequence as Basic Authentication, but it includes more information, and uses hashing algorithms:
HTTP/1.1 401 Unauthorized
Date: Sat, 30 Apr 2016 23:39:59 GMT
WWW-Authenticate: Digest realm="sekrit_stuff",
qop=auth
nonce="58647bd2acd7935c4b058702c363e872",
opaque="05dd324d8d5129a65a4e0c8d34b9e1f3"
Content-Type: text/html
Content-Length: 0
GET /index.html HTTP/1.1
Host: localhost
Authorization: Digest username="Mufasa",
realm="testrealm@host.com",
nonce="f527aab47bb12114c31725da7df9fb7e",
uri="/index.html",
qop=auth,
nc=00000001,
cnonce="f1c4a297",
response="6629fae49393a05397450978507c4ef1",
opaque="05dd324d8d5129a65a4e0c8d34b9e1f3"
Server Response
Client Authenticated Request
Security
Proxy Authentication
Hop-by-hop version of WWW Authenticate and Authorization
headers, intended for proxy usage.
Proxy-Authenticate
Proxy-Authorization
Use 407 Proxy Authentication Required instead of a 401 status line.
Errors
HTTP/1.1 added 24 new status codes, plus a mechanism for returning warnings on an ostensibly successful response.
Notable new status codes:
- 409 (Conflict), returned when a request would conflict with the current state of the resource. For example, a PUT request might violate a versioning policy.
- 410 (Gone), used when a resource has been removed permanently from a server, and to aid in the deletion of any links to the resource.
Errors
https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.46
Warnings
Warning headers provide caveats or additional information about an ostensibly successful response.
Can be used to provide additional information on caching operations or transformations. See the RFC if you are implementing complex cache/transformation behaviors.
It's Complex
HTTP is a mix of specification and description of common usage.
HTTP/1.1 https://www.w3.org/Protocols/rfc2616/rfc2616.html
It has evolved substantially over time.
HTTP/2.0 https://tools.ietf.org/html/rfc7540
HTTP downgrades well when features that a client wants are missing, so for any server implementation, start simply, then layer HTTP features, referring to the RFCs frequently. When in doubt, research how other implementations do it.
Web Server Architecture
There is nothing special about web server architecture.
It is just server architecture.
A web server is nothing more than a server that receives HTTP requests and returns HTTP responses, typically through standard socket communications.
require 'socket'
def run
simple_server = TCPServer.new("0.0.0.0", 8080)
loop do
connection = simple_server.accept
handle connection
end
end
def get_request connection
connection.gets
end
def process request
"OK"
end
def handle connection
request = get_request connection
response = process request
connection.puts response
connection.close
end
run
Basic TCP Server
OK.... Server Architecture
• Setup
• Command line arguments
• Configuration
• Listen on socket(s)
• Enter main loop
• Accept socket connection
• Handle connection
• Be prepared to handle signals for clean shutdown
Setup
Command Line Arguments
Ruby has an absurd wealth of different libraries and frameworks for handing CLI option parsing and app creation.
OptionParser | http://ruby-doc.org/stdlib-2.3.0/libdoc/optparse/rdoc/OptionParser.html |
Highline | https://github.com/JEG2/highline |
Methadone | https://github.com/davetron5000/methadone |
GetOpt | https://github.com/djberg96/getopt |
Belafonte | https://github.com/ess/belafonte |
Main | https://github.com/ahoward/main |
GLI | https://github.com/davetron5000/gli |
Thor | https://github.com/erikhuda/thor |
Samovar | https://github.com/ioquatix/samovar |
...and about 30 others. Use what you like. Ruby has options!
Setup
Configuration
As with command line options, Ruby offers a diverse set of simple tools for handling configuration files.
Ruby |
|
JSON File |
|
YAML File |
|
Configatron | https://github.com/markbates/configatron |
Settingslogic | https://github.com/settingslogic/settingslogic |
Class AppConfiguration
LISTEN_ON = ['0.0.0.0:80','127.0.0.1:5000']
end
require 'json'
AppConfiguration = JSON.parse( File.read( config_file_path ) )
require 'yaml'
AppConfiguration = YAML.load( File.read( config_file_path ) )
require 'app_configuration.rb'
Listen on Sockets
AKA Network Communications
A server needs a way to receive requests and to return responses.
The two most common options:
- Native Ruby Networking Support (TCPServer and friends)
- EventMachine
Listen on Sockets
Native Ruby Networking Support
Ruby has a rich set of networking libraries, making it easy to write TCP clients and servers.
require 'socket'
class SimpleServer < TCPServer
def initialize( address, port )
super( address, port )
end
def run( address = '127.0.0.1', port = 80 )
loop do
socket = server.accept
handle_request( socket.gets )
end
end
def handle_request( req )
# Do Stuff with req
end
end
Main Loop
i.e. Concurrency Options
How a server handles concurrency is fundamental to it's design.
The prior basic server template was a blocking server.
Handles a single request at a time.
Long running requests cause the OS to queue waiting requests on the socket.
Main Loop
Blocking Server
Short Tangent
Simplest Ruby Web Server....
ruby -run -e httpd -- -p 8080 .
Short Tangent
Simplest Ruby Web Server....
ruby -run -e httpd -- -p 8080 .
OK....That's kind of cheating.
# frozen_string_literal: false
#
# = un.rb
#
# Copyright (c) 2003 WATANABE Hirofumi <eban@ruby-lang.org>
#
# This program is free software.
# You can distribute/modify this program under the same terms of Ruby.
#
# == Utilities to replace common UNIX commands in Makefiles etc
#
# == SYNOPSIS
#
# ruby -run -e cp -- [OPTION] SOURCE DEST
# ruby -run -e ln -- [OPTION] TARGET LINK_NAME
# ruby -run -e mv -- [OPTION] SOURCE DEST
# ruby -run -e rm -- [OPTION] FILE
# ruby -run -e mkdir -- [OPTION] DIRS
# ruby -run -e rmdir -- [OPTION] DIRS
# ruby -run -e install -- [OPTION] SOURCE DEST
# ruby -run -e chmod -- [OPTION] OCTAL-MODE FILE
# ruby -run -e touch -- [OPTION] FILE
# ruby -run -e wait_writable -- [OPTION] FILE
# ruby -run -e mkmf -- [OPTION] EXTNAME [OPTION]
# ruby -run -e httpd -- [OPTION] DocumentRoot
# ruby -run -e help [COMMAND]
Short Tangent
def httpd
setup("", "BindAddress=ADDR", "Port=PORT", "MaxClients=NUM", "TempDir=DIR",
"DoNotReverseLookup", "RequestTimeout=SECOND", "HTTPVersion=VERSION") do
|argv, options|
require 'webrick'
opt = options[:RequestTimeout] and options[:RequestTimeout] = opt.to_i
[:Port, :MaxClients].each do |name|
opt = options[name] and (options[name] = Integer(opt)) rescue nil
end
options[:Port] ||= 8080 # HTTP Alternate
options[:DocumentRoot] = argv.shift || '.'
s = WEBrick::HTTPServer.new(options)
shut = proc {s.shutdown}
siglist = %w"TERM QUIT"
siglist.concat(%w"HUP INT") if STDIN.tty?
siglist &= Signal.list.keys
siglist.each do |sig|
Signal.trap(sig, shut)
end
s.start
end
end
A Real (simple) Server
require 'socket'
require 'mime-types'
require 'time'
trap 'INT' do; exit end # in a real server, you want more more cleanup than this
DOCROOT = Dir.pwd
CANNED_OK = "HTTP/1.0 200 OK\r\n"
CANNED_NOT_FOUND = "HTTP/1.0 404 Not Found\r\n"
CANNED_BAD_REQUEST = "HTTP/1.0 400 Bad Request\r\n"
def run( host = '0.0.0.0', port = '8080' )
server = TCPServer.new( host, port )
while connection = server.accept
request = get_request connection
response = handle request
connection.write response
connection.close
end
end
def get_request connection
r = ''
while line = connection.gets
r << line
break if r =~ /\r\n\r\n/m # Request headers terminate with \r\n\r\n
end
if r =~ /^(\w+) +(?:\w+:\/\/([^ \/]+))?(([^ \?\#]*)\S*) +HTTP\/(\d\.\d)/
request_method = $1
unparsed_uri = $3
uri = $4.empty? ? nil : $4
http_version = $5
name = $2 ? $2.intern : nil
uri = uri.tr( '+', ' ' ).
gsub( /((?:%[0-9a-fA-F]{2})+)/n ) { [$1.delete( '%' ) ].pack( 'H*' ) } if uri.include?('%')
[ request_method, http_version, name, unparsed_uri, uri ]
else
nil
end
end
def handle request
if request
process request
else
CANNED_BAD_REQUEST + final_headers
end
end
def process request
path = File.join( DOCROOT, request.last )
if FileTest.exist?( path ) and FileTest.file?( path ) and File.expand_path( path ).index( DOCROOT ) == 0
CANNED_OK +
"Content-Type: #{MIME::Types.type_for( path )}\r\n" +
"Content-Length: #{File.size( path )}\r\n" +
"Last-Modified: #{File.mtime( path )}\r\n" +
final_headers +
File.read( path )
else
CANNED_NOT_FOUND + final_headers
end
end
def final_headers
"Date: #{Time.now.httpdate}\r\nConnection: close\r\n\r\n"
end
run
A little over 70 lines, and it's enough to build on....
A Real (simple) Server
- HTTP 1.0 (barely)
- Doesn't actually parse HTTP
- Simple, and easily improved
- Blocking server; no concurrency
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /simple_blocking_server.rb
Document Length: 1844 bytes
Concurrency Level: 1
Time taken for tests: 2.635 seconds
Complete requests: 10000
Failed requests: 0
Total transferred: 20340000 bytes
HTML transferred: 18440000 bytes
Requests per second: 3795.08 [#/sec] (mean)
Time per request: 0.263 [ms] (mean)
Time per request: 0.263 [ms] (mean, across all concurrent requests)
Transfer rate: 7538.27 [Kbytes/sec] received
- Pretty fast? 3795/sec...
- localhost
- small static assets
- easy to be fast when you don't do much...
All examples ran with Ruby 2.3.1
A Real (simple) Server
Slow Responses....
Imagine that it takes some time to generate those responses and return them across the internet to the client.
Make each one take at least one second....
--- simple_blocking_server.rb 2016-05-01 11:16:13.750705766 -0400
+++ simple_blocking_server_slow.rb 2016-05-01 16:08:31.422044736 -0400
@@ -56,6 +56,7 @@
# This server is stupid. For any request method, and http version, it just tries to serve a static file.
path = File.join( DOCROOT, request.last )
if FileTest.exist?( path ) and FileTest.file?( path ) and File.expand_path( path ).index( DOCROOT ) == 0
+ sleep 1
CANNED_OK +
"Content-Type: #{MIME::Types.type_for( path )}\r\n" +
"Content-Length: #{File.size( path )}\r\n" +
A Real (simple) Server
Server Software:
Server Hostname: 127.0.0.1
Server Port: 8080
Document Path: /simple_blocking_server.rb
Document Length: 1844 bytes
Concurrency Level: 1
Time taken for tests: 20.022 seconds
Complete requests: 20
Failed requests: 0
Total transferred: 40680 bytes
HTML transferred: 36880 bytes
Requests per second: 1.00 [#/sec] (mean)
Time per request: 1001.093 [ms] (mean)
Time per request: 1001.093 [ms] (mean, across all concurrent requests)
Transfer rate: 1.98 [Kbytes/sec] received
Almost 8000 requests/second down to 1 request/second. Ouch.
How To Address This?
Concurrency!
Multiprocessing
Multithreading
Event Based
Main Loop
Multiprocessing
Just run a bunch of blocking servers, and have something else distribute and balance the load to them.
Load Balancer
Blocking Server
Blocking Server
Blocking Server
CONCURRENCY! WINNING!
Main Loop
Multiprocessing
Just run a bunch of blocking servers, and have something else distribute and balance the load to them.
Pros
Cons
- Simple to implement.
- Performance can still be quite good.
- Managing processes can be complex.
- Limited sharing of resources can be expensive.
Main Loop
--- kiss_slow.rb 2016-05-01 16:08:31.422044736 -0400
+++ kiss_multiprocessing.rb 2016-05-01 16:45:13.238012734 -0400
@@ -11,6 +11,8 @@
def run( host = '0.0.0.0', port = '8080' )
server = TCPServer.new( host, port )
+ fork_it
+
while connection = server.accept
request = get_request connection
response = handle request
@@ -20,6 +22,18 @@
end
end
+def fork_it( process_count = 9 )
+ pid = nil
+ process_count.times do
+ if pid = fork
+ Process.detach( pid )
+ else
+ break
+ end
+ end
+
+end
+
def get_request connection
r = ''
while line = connection.gets
Listen on a port, then fork.
A child processes share opened ports. OS load balances.
YMMV depending on OS.
Multiprocessing Simple Blocking Server
Main Loop
Document Path: /simple_blocking_server.rb
Document Length: 1844 bytes
Concurrency Level: 10
Time taken for tests: 20.113 seconds
Complete requests: 200
Failed requests: 0
Total transferred: 406800 bytes
HTML transferred: 368800 bytes
Requests per second: 9.94 [#/sec] (mean)
Time per request: 1005.662 [ms] (mean)
Time per request: 100.566 [ms] (mean, across all concurrent requests)
Transfer rate: 19.75 [Kbytes/sec] received
Multiprocessing Simple Blocking Server
Main Loop
Ruby pre 2.0 was very copy-on-write unfriendly.
Multiprocessing consumed large amounts of RAM.
Modern Rubies are more resource friendly when forking.
VSZ RSS
------ -----
596208 14888
53256 12892
120984 12956
188568 12988
256180 12992
323788 13000
391368 13040
458952 13044
526544 13056
594152 13092
-----
131948K
Multiprocessing Simple Blocking Server
Main Loop
Multiprocessing Simple Blocking Server
With slow requests, multiprocessing with blocking servers still often feels like this.
Main Loop
Multithreading
A thread is the smallest sequence of instructions that can be managed independently by the scheduler. Multiple threads will share one process's memory.
Pros
Cons
- Threading implementations vary a lot across Ruby implementations.
- Locking issues on shared resources can be complicated. i.e. Threads are hard and it's easy to screw it up.
- Easier to manage/load-balance in a single piece of software.
- Threads are lightweight, so resource usage is generally better.
- Can be very performant.
Main Loop
Multithreaded Server
Programming with threads can easily be a talk all by itself. A few quick guides and tutorials:
Main Loop
Multithreaded Server
--- server_slow.rb 2016-05-01 16:08:31.422044736 -0400
+++ server_multithreaded.rb 2016-05-01 21:56:47.997815573 -0400
@@ -11,12 +11,14 @@
def run( host = '0.0.0.0', port = '8080' )
server = TCPServer.new( host, port )
- while connection = server.accept
- request = get_request connection
- response = handle request
+ while con = server.accept
+ Thread.new( con ) do |connection|
+ request = get_request connection
+ response = handle request
- connection.write response
- connection.close
+ connection.write response
+ connection.close
+ end
end
end
Simple, naive implementation - a new thread for every request, and assume everything else just works with this.
Main Loop
Multithreaded Server
Concurrent Requests | Requests per Second |
---|---|
10 | 9.95 |
20 | 19.80 |
50 | 48.98 |
100 | 96.26 |
200 | 184.45 |
1000 | 628.33 |
Slow requests scale pretty well with threads. Diminishing returns when thread count gets high, but not bad for such a trivial implementation.
Main Loop
Event Driven Server
"Event Driven" is a vague label, encompassing numerous patterns and feature sets. One of the most common of these patterns is the Reactor pattern.
The Reactor pattern describes a system that handles asynchronous events, but that does so with synchronous event callbacks.
Main Loop
Event Driven Server
Client/Server interactions are often slow, but most of that time is spent waiting on latencies. CPUs are fast. The rest of the world is pretty slow.
Main Loop
Event Driven Server
An event reactor just spins in a loop, waiting for something to happen - such as a network connection, or data to read or two write.
When it does, an event -- a callback -- is triggered to deal with it.
Callbacks block the reactor.
Main Loop
Event Driven Server
Pros
Cons
- Slow callbacks block the reactor.
- Callback structured programming can be confusing.
- Can be very fast and resource friendly.
- With an appropriate underlying event notification facility, can scale to thousands of simultaneous connections.
Main Loop
Event Driven Server
Like Threading, this could easily be a talk all by itself. A few resources for further reading:
Main Loop
Event Driven Server
Many ways to do it, including EventMachine, Celluloid.io, or even a simple pure ruby event framework (SimpleReactor).
Events/reactor stuff gets complicated, so see the examples for code for a couple simple variants. I didn't do a simple diff version of the slow server.
However...
Main Loop
Event Driven Server
require 'simplereactor'
require 'tinytypes'
require 'getoptlong'
require 'socket'
require 'time'
class SimpleWebServer
attr_reader :threaded
EXE = File.basename __FILE__
VERSION = "1.0"
def self.parse_cmdline
initialize_defaults
opts = GetoptLong.new(
[ '--help', '-h', GetoptLong::NO_ARGUMENT],
[ '--threaded', '-t', GetoptLong::NO_ARGUMENT],
[ '--processes', '-n', GetoptLong::REQUIRED_ARGUMENT],
[ '--engine', '-e', GetoptLong::REQUIRED_ARGUMENT],
[ '--port', '-p', GetoptLong::REQUIRED_ARGUMENT],
[ '--docroot', '-d', GetoptLong::REQUIRED_ARGUMENT]
)
opts.each do |opt, arg|
case opt
when '--help'
puts <<-EHELP
#{EXE} [OPTIONS]
#{EXE} is a very simple web server. It only serves static files. It does very
little parsing of the HTTP request, only fetching the small amount of
information necessary to determine what resource is being requested. The server
defaults to serving files from the current director when it was invoked.
-h, --help:
Show this help.
-d DIR, --docroot DIR:
Provide a specific directory for the docroot for this server.
-e ENGINE, --engine ENGINE:
Tell the webserver which IO engine to use. This is passed to SimpleReactor,
and will be one of 'select' or 'nio'. If not specified, it will attempt to
use nio, and fall back on select.
-p PORT, --port PORT:
The port for the web server to listen on. If this flag is not used, the web
server defaults to port 80.
-b HOSTNAME, --bind HOSTNAME:
The hostname/IP to bind to. This defaults to 127.0.0.1 if it is not provided.
-n COUNT, --processes COUNT:
The number of processess to create of this web server. This defaults to a single process.
-t, --threaded:
Wrap content deliver in a thread to hedge against slow content delivery.
EHELP
exit
when '--docroot'
@docroot = arg
when '--engine'
@engine = arg
when '--port'
@port = arg.to_i != 0 ? arg.to_i : @port
when '--bind'
@host = arg
when '--processes'
@processes = arg.to_i != 0 ? arg.to_i : @port
when '--threaded'
@threaded = true
end
end
end
def self.initialize_defaults
@docroot = '.'
@engine = 'nio'
@port = 80
@host = '127.0.0.1'
@processes = 1
@threaded = false
end
def self.docroot
@docroot
end
def self.engine
@engine
end
def self.port
@port
end
def self.host
@host
end
def self.processes
@processes
end
def self.threaded
@threaded
end
def self.run
parse_cmdline
SimpleReactor.use_engine @engine.to_sym
webserver = SimpleWebServer.new
webserver.run
end
def initialize
@children = nil
@docroot = self.class.docroot
@threaded = self.class.threaded
end
def run
@server = TCPServer.new self.class.host, self.class.port
handle_processes
SimpleReactor.Reactor.run do |reactor|
@reactor = reactor
@reactor.attach @server, :read do |monitor|
connection = monitor.io.accept
handle_request '',connection, monitor
end
end
end
def handle_request buffer, connection, monitor = nil
eof = false
buffer << connection.read_nonblock(16384)
rescue EOFError
eof = true
rescue IO::WaitReadable
# This is actually handled in the logic below. We just need to survive it.
ensure
request = parse_request buffer, connection
if !request && monitor
@reactor.next_tick do
@reactor.attach connection, :read do |mon|
handle_request buffer, connection
end
end
elsif eof && !request
deliver_400 connection
elsif request
handle_response_for request, connection
end
if eof
queue_detach connection
end
end
def queue_detach connection
@reactor.next_tick do
@reactor.detach(connection)
connection.close
end
end
def handle_response_for request, connection
path = File.join( @docroot, request[:uri] )
if FileTest.exist?( path ) and FileTest.file?( path ) and File.expand_path( path ).index( @docroot ) == 0
deliver path, connection
else
deliver_404 path, connection
end
end
def parse_request buffer, connection
if buffer =~ /^(\w+) +(?:\w+:\/\/([^ \/]+))?([^ \?\#]*)\S* +HTTP\/(\d\.\d)/
request_method = $1
uri = $3
http_version = $4
if $2
name = $2.intern
uri = C_slash if @uri.empty?
# Rewrite the request to get rid of the http://foo portion.
buffer.sub!(/^\w+ +\w+:\/\/[^ \/]+([^ \?]*)/,"#{@request_method} #{@uri}")
buffer =~ /^(\w+) +(?:\w+:\/\/([^ \/]+))?([^ \?\#]*)\S* +HTTP\/(\d\.\d)/
request_method = $1
uri = $3
http_version = $4
end
uri = uri.tr('+', ' ').gsub(/((?:%[0-9a-fA-F]{2})+)/n) {[$1.delete('%')].pack('H*')} if uri.include?('%')
unless name
if buffer =~ /^Host: *([^\r\0:]+)/
name = $1.intern
end
end
{ :uri => uri, :request_method => request_method, :http_version => http_version, :name => name }
end
end
def deliver uri, connection
if FileTest.directory? uri
deliver_directory connection
else
if threaded
Thread.new { _deliver uri, connection }
else
_deliver uri, connection
end
end
rescue Errno::EPIPE
rescue Exception => e
deliver_500 connection, e
end
def _deliver uri, connection
data = File.read(uri)
last_modified = File.mtime(uri).httpdate
sleep 1
connection.write "HTTP/1.1 200 OK\r\nContent-Length:#{data.length}\r\nContent-Type: #{content_type_for uri}\r\nLast-Modified: #{last_modified}\r\nConnection:close\r\n\r\n#{data}"
queue_detach connection
end
def deliver_directory
deliver_403 connection
end
def deliver_404 uri, connection
buffer = "The requested resource (#{uri}) could not be found."
connection.write "HTTP/1.1 404 Not Found\r\nContent-Length:#{buffer.length}\r\nContent-Type:text/plain\r\nConnection:close\r\n\r\n#{buffer}"
rescue Errno::EPIPE
ensure
queue_detach connection
end
def deliver_400 connection
buffer = "The request was malformed and could not be completed."
connection.write "HTTP/1.1 400 Bad Request\r\nContent-Length:#{buffer.length}\r\nContent-Type:text/plain\r\nConnection:close\r\n\r\n#{buffer}"
rescue Errno::EPIPE
ensure
queue_detach connection
end
def deliver_403 connection
buffer = "Forbidden. The requested resource can not be accessed."
connection.write "HTTP/1.1 403 Bad Request\r\nContent-Length:#{buffer.length}\r\nContent-Type:text/plain\r\nConnection:close\r\n\r\n#{buffer}"
rescue Errno::EPIPE
ensure
queue_detach connection
end
def deliver_500 connection, error
buffer = "There was an internal server error -- #{error}"
puts buffer
connection.write "HTTP/1.1 500 Bad Request\r\nContent-Length:#{buffer.length}\r\nContent-Type:text/plain\r\nConnection:close\r\n\r\n#{buffer}"
rescue Errno::EPIPE
ensure
queue_detach connection
end
def content_type_for path
MIME::TinyTypes.types.simple_type_for( path ) || 'application/octet-stream'
end
def data_for path_info
path = File.join(@docroot,path_info)
path if FileTest.exist?(path) and FileTest.file?(path) and File.expand_path(path).index(docroot) == 0
end
def handle_processes
if self.class.processes > 1
@children = []
(self.class.processes - 1).times do |thread_count|
pid = fork()
if pid
@children << pid
else
break
end
end
Thread.new { Process.waitall } if @children
end
end
end
SimpleWebServer.run
Main Loop
Event Driven Server
With slow requests, this is no better than the simple blocking server.
Document Path: /server.rb
Document Length: 1844 bytes
Concurrency Level: 1
Time taken for tests: 10.014 seconds
Complete requests: 10
Failed requests: 0
Total transferred: 19880 bytes
HTML transferred: 18440 bytes
Requests per second: 1.00 [#/sec] (mean)
Time per request: 1001.423 [ms] (mean)
Time per request: 1001.423 [ms] (mean, across all concurrent requests)
Transfer rate: 1.94 [Kbytes/sec] received
Main Loop
Event Driven Server
However, where events shine is in dealing with high communications latencies.
They don't block on IO, so an efficient reactor can service large numbers of high latency connections efficiently.
Main Loop
Event Driven Server
So, while performance is obviously good when testing from localhost:
Document Path: /server.rb
Document Length: 1844 bytes
Concurrency Level: 50
Time taken for tests: 5.866 seconds
Complete requests: 20000
Failed requests: 0
Total transferred: 39760000 bytes
HTML transferred: 36880000 bytes
Requests per second: 3409.40 [#/sec] (mean)
Time per request: 14.665 [ms] (mean)
Time per request: 0.293 [ms] (mean, across all concurrent requests)
Transfer rate: 6619.03 [Kbytes/sec] received
Main Loop
Event Driven Server
Evented IO handling keeps things happy at high concurrencies, even across the country. For example, this is from anEngine Yard AWS Oregon instance talking to a non-AWS VM on the east coast.
Document Path: /server.rb
Document Length: 1844 bytes
Concurrency Level: 1000
Time taken for tests: 3.394 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 19907832 bytes
HTML transferred: 18465816 bytes
Requests per second: 2946.48 [#/sec] (mean)
Time per request: 339.387 [ms] (mean)
Time per request: 0.339 [ms] (mean, across all concurrent requests)
Transfer rate: 5728.33 [Kbytes/sec] received
Main Loop
Event Driven Server
Concurrent Requests | Requests per Second |
---|---|
10 | 3888 |
50 | 3814 |
250 | 3651 |
1000 | 3609 |
2000 | 3261 |
10000 | 2740 |
And if your event notification framework is up to it, concurrencies can scale nicely.
Fast Responses
Main Loop
Event Driven Server - Hybridization!
Main Loop
Event Driven Server - Hybridization!
Evented IO can also mix well with threading, letting the slow stuff be slow, while not blocking the reactor from spinning and servicing things that are ready to be serviced.
Main Loop
Event Driven Server
Concurrent Requests | Requests per Second |
---|---|
10 | 8 |
50 | 30 |
250 | 181 |
500 | 340 |
1000 | 511 |
2000 | 854 |
10000 | 713 |
Slow (1 second delayed) Responses
Main Loop
Event Driven Server
Reactor/Event IO combined with threading is a great combination, if you are willing to deal with the complexity of implementation.
Even pure Ruby + Ruby 2.3 is pretty fast
Main Loop
EventMachine
- 10 years of Ruby history
- C++ core
- Quite fast
- Code can get complicated and confusing because of the callback based structure.
Main Loop
EventMachine - How Fast?
Document Path: /smallfile1462361058
Document Length: 1021 bytes
Concurrency Level: 25
Time taken for tests: 6.474 seconds
Complete requests: 100000
Failed requests: 0
Keep-Alive requests: 100000
Total transferred: 120300000 bytes
HTML transferred: 102100000 bytes
Requests per second: 15447.11 [#/sec] (mean)
Time per request: 1.618 [ms] (mean)
Time per request: 0.065 [ms] (mean, across all concurrent requests)
Transfer rate: 18147.34 [Kbytes/sec] received
Parsing HTTP
It all comes back to the HTTP.
HTTP is a text grammar. To know what a client wants, the server has to be able to make sense of this grammar.
Parsing HTTP
Most of the very simple examples so far have cheated with this.
They use regular expressions. Going too far down that path will drive you to madness and sorrow.
Parsing HTTP
You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML. Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the nerves of the sentient whilst you observe, your psyche withering in the onslaught of horror. Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chi͡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of regex parsers for HTML will instantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection will devour your HTML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fight he com̡e̶s, ̕h̵is un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo͟ur eye͢s̸ ̛l̕ik͏e liquid pain, the song of re̸gular expression parsing will extinguish the voices of mortal man from the sphere I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful the final snuf
fing of the lies of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL IS LOST the pon̷y he comes he c̶̮omes he comes the ichor permeates all MY FACE MY FACE ᵒh god no NO NOO̼OO NΘ stop the an*̶͑̾̾̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e
not rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ
With Regular Expressions
Parsing HTTP
With Regular Expressions
For limited, specific pieces of information, you can get away with it. People may hate you, but you can do it. Future sanity is not guaranteed, however.
HTTP is so complicated that this approach is madness for most purposes. Use a real parser.
Parsing HTTP
Most People Don't Like Writing Parsers
Fortunately (unless that is the sort of thing that floats your boat) you don't have to. There are several viable options readily available.
Parsing HTTP
Mongrel's Parser
Zed Shaw wrote Mongrel, a Ruby web and application server, about 10 years ago.
The HTTP parser that it used was built using the Ragel state machine compiler. It is a robust and pretty fast. In some version or other, it is used in many Ruby web server implementations.
Parsing HTTP
def process_client(client)
begin
parser = HttpParser.new
params = HttpParams.new
request = nil
data = client.readpartial(Const::CHUNK_SIZE)
nparsed = 0
# Assumption: nparsed will always be less since data will get filled with more
# after each parsing. If it doesn't get more then there was a problem
# with the read operation on the client socket. Effect is to stop processing when the
# socket can't fill the buffer for further parsing.
while nparsed < data.length
nparsed = parser.execute(params, data, nparsed)
if parser.finished?
if not params[Const::REQUEST_PATH]
# it might be a dumbass full host request header
uri = URI.parse(params[Const::REQUEST_URI])
params[Const::REQUEST_PATH] = uri.path
end
raise "No REQUEST PATH" if not params[Const::REQUEST_PATH]
script_name, path_info, handlers = @classifier.resolve(params[Const::REQUEST_PATH])
if handlers
params[Const::PATH_INFO] = path_info
params[Const::SCRIPT_NAME] = script_name
# From http://www.ietf.org/rfc/rfc3875 :
# "Script authors should be aware that the REMOTE_ADDR and REMOTE_HOST
# meta-variables (see sections 4.1.8 and 4.1.9) may not identify the
# ultimate source of the request. They identify the client for the
# immediate request to the server; that client may be a proxy, gateway,
# or other intermediary acting on behalf of the actual source client."
params[Const::REMOTE_ADDR] = client.peeraddr.last
# select handlers that want more detailed request notification
notifiers = handlers.select { |h| h.request_notify }
request = HttpRequest.new(params, client, notifiers)
# in the case of large file uploads the user could close the socket, so skip those requests
break if request.body == nil # nil signals from HttpRequest::initialize that the request was aborted
# request is good so far, continue processing the response
response = HttpResponse.new(client)
# Process each handler in registered order until we run out or one finalizes the response.
handlers.each do |handler|
handler.process(request, response)
break if response.done or client.closed?
end
# And finally, if nobody closed the response off, we finalize it.
unless response.done or client.closed?
response.finished
end
else
# Didn't find it, return a stock 404 response.
client.write(Const::ERROR_404_RESPONSE)
end
break #done
else
# Parser is not done, queue up more data to read and continue parsing
chunk = client.readpartial(Const::CHUNK_SIZE)
break if !chunk or chunk.length == 0 # read failed, stop processing
data << chunk
if data.length >= Const::MAX_HEADER
raise HttpParserError.new("HEADER is longer than allowed, aborting client early.")
end
end
end
rescue EOFError,Errno::ECONNRESET,Errno::EPIPE,Errno::EINVAL,Errno::EBADF
client.close rescue nil
rescue HttpParserError => e
Mongrel.log(:error, "#{Time.now.httpdate}: HTTP parse error, malformed request (#{params[Const::HTTP_X_FORWARDED_FOR] || client.peeraddr.last}): #{e.inspect}")
Mongrel.log(:error, "#{Time.now.httpdate}: REQUEST DATA: #{data.inspect}\n---\nPARAMS: #{params.inspect}\n---\n")
# http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4
client.write(Const::ERROR_400_RESPONSE)
rescue Errno::EMFILE
reap_dead_workers('too many files')
rescue Object => e
Mongrel.log(:error, "#{Time.now.httpdate}: Read error: #{e.inspect}")
Mongrel.log(:error, e.backtrace.join("\n"))
ensure
begin
client.close
rescue IOError
# Already closed
rescue Object => e
Mongrel.log(:error, "#{Time.now.httpdate}: Client error: #{e.inspect}")
Mongrel.log(:error, e.backtrace.join("\n"))
end
request.body.delete if request and request.body.class == Tempfile
end
end
Parsing HTTP
WEBrick has a serviceable, if not gloriously fast, parser.
Just pass it an IO object, and it'll returned a structure built from the parsed HTTP.
request = WEBrick::HTTPRequest.new( WEBrick::Config::HTTP )
request.parse( request_io )
Parsing HTTP
The HTTP parser that is part of the EVMA_HTTPServer web server implementation that is in the EventMachine project. It could use some love, but it works.
require 'mime-types'
require 'eventmachine'
require 'evma_httpserver'
trap 'INT' do; exit end # in a real server, you want more more cleanup than this
class MyHttpServer < EM::Connection
include EM::HttpServer
DOCROOT = Dir.pwd
def post_init
super
no_environment_strings
end
def process_http_request
response = EM::DelegatedHttpResponse.new(self)
path = File.join( DOCROOT, @http_request_uri )
if FileTest.exist?( path ) and FileTest.file?( path ) and File.expand_path( path ).index( DOCROOT ) == 0
response.status = 200
response.content_type MIME::Types.type_for( path ).last.to_s
response.content = File.read( path )
response.send_response
else
response.status = 200
response.content = "The resource #{path} could not be found."
response.send_response
end
end
def final_headers
"Date: #{Time.now.httpdate}\r\nConnection: close\r\n\r\n"
end
end
EM.run {
EM.start_server '0.0.0.0', 80, MyHttpServer
}
Parsing HTTP
There is a Ruby library, http_parser.rb, that is built on top of the Joyent (NodeJS) parser. This is also a fast, battle proven parser like the Mongrel parser, and is widely used.
Signals and Miscellany
The Other Details To Be a Well Behaved Daemon
- Handle signals so that you die gracefully
- Cleanup after yourself
Web Server
vs
Application Server
• A web server speaks HTTP with clients, answering requests for resources.
• An applications server contains an application, and answers requests for application resources.
• Generally, application servers are a specific type of web server.
Rack
Rack provides an interface between a web server and a Ruby application.
Rack has a simple interface.
An object that responds to the call method, and takes a hash of environment variables.
The call method returns an array containing the HTTP response code, a hash of HTTP headers, and the response body, which is contained in an object that responds to each.
Rack
Canonical example
# my_rack_app.rb
require 'rack'
app = Proc.new do |env|
['200', {'Content-Type' => 'text/html'}, ['A barebones rack app.']]
end
Rack::Handler::WEBrick.run app
Rack
Rails and most other Ruby web frameworks utilize Rack. That is, a Rails application is Rack application.
Rack
Rack Handlers
A Rack handler is the glue between the rack application and the web server that is responsible for handling it.
Rack itself ships with handlers for CGI, FCGI, SCGI, the LiteSpeed Web Server (LSWS), Thin, and Webrick.
You can write your own.
https://github.com/rack/rack/blob/master/lib/rack/handler/cgi.rb for an example.
Ruby Web Servers
A Quick Survey
- WEBrick
- Mongrel
- Thin
- Puma
- Passenger
- Unicorn
- Rainbows!
- Yahns
- Goliath
- Swiftiply
WEBrick
• Pure Ruby
• Thread based design
• Written in 2000 by Masayoshi Takahashi and Yuuzou Gotou.
• Ubiquitous, as it is bundled with Ruby itself.
• Flexible, fairly featureful, and easy to use.
• Fairly well documented.
• Fairly slow.
Mongrel
• Zed Shaw, 2006
• Ruby plus a C extension for parsing HTTP, built with Ragel
• Moderately fast for it's age
• EOL at 1.1.5 in a version that doesn't work with modern Rubies
• There is a 1.2.x version (gem install --pre mongrel) that does work with modern Rubies
• Completely unmaintained, but it is interesting code to look at and learn from.
Swiftiply
• My baby. May 2007
• Built on top of EventMachine
• Original used regex for HTTP parsing....
• Structurally has support for real parsing, but needs work
• Intended to be a load balancing reverse proxy with a twist, with light web serving capabilities (static files, mostly)
• Very fast
*• AFAIK, at least 100 production sites still use it
Thin
• Marc-André Cournoyer, 2008
• Mongrel HTTP Parser
• Rack interface
• EventMachine
• Pretty fast. Still pretty commonly used
Goliath
• PostRank Labs, 2011
• Built on top of EventMachine
• Completely asynchronous design, leveraging Ruby fibers to unwind callback complexity
• Performance and resource focused
• Niche usage
Puma
• Evan Phoenix, 2011
• Built on the bones of Mongrel
• Built from the ground up with concurrency in mind
• Rack
• Runs on all major Ruby implementations (MRI, JRuby, Rubinius)
• If you use Thin, take a look at Puma
Passenger
• Phusion Passenger
• Heavy Rails world usage
• Directly integrates to Apache or Nginx
• New version includes a fast purpose built web server, Raptor
• Rack
• Commercial version has more multithreading/concurrency options than the single open source version does
Unicorn
• Built on Mongrel's bones by Eric Wong circa 2009
• fork() (i.e. built in multiprocessing) oriented
• Not the fastest, but it deals well with slow/variable requests
• Very mature, and very heavy utilization in the Rails world
• With modern copy-on-write friendly Rubies, can have nice resource utilization
Rainbows!
• Unicorn specifically tuned for those very big, very slow requests
Yahns
• Another in the Unicorn family
• Tuned for apps that receive very little traffic
• When idle, it is truly idle
• Very sensitive to app failures
Thank You!
Let's build some stuff. Ask me questions. Tell me what you want.
Thanks to Engine Yard
Your Trusted DevOps.
Over 1 Billion AWS Hours.
With Engine Yard Cloud, do what you do best—developing Ruby on Rails, Node.js, or PHP apps—while we do what we do best—ensuring your environment runs smoothly. You can be as hands on or hands off with AWS as you want.
Start developing your apps on our AWS account or yours for free today.
We’ve deployed it all....
Let's Build a Webserver!
By wyhaines
Let's Build a Webserver!
Railsconf 2016 Workshop -- Building web servers with Ruby
- 1,267