Storage Systems I

INFO 153B/253B: Backend Web Architecture

Kay Ashaolu

Why Data Storage?

  • When we make a web request, where do we get the data from?
  • When we create data, where do we put it?
  • Where are "resources" stored?

Example: bit.ly

  • Lots of data to store
    • Shortcut to url mapping
    • Statistics about links
    • Information about users

Example: bit.ly

long url http://news.google.com
short url http://bit.ly/awekl
hit count 482240

long url http://facebook.com/user/profile
short url http://bit.ly/czasw
hit count 11023

long url http://msnbc.com/news/article/
short url http://bit.ly/olkjpl
hit count 1232

Data Storage Design

  • What is the storage format?
  • How do we lay out data?
  • How do we access data?

Why use a file?

  • What are the pros and cons?
http://news.google.com, http://bit.ly/awekl, 482240
http://facebook.com/user/profile, http://bit.ly/czasw, 11023
http://msnbc.com/news/article, http://bit.ly/olkjpl, 1232

Problems with files

  • What if we want to add another field?
  • What if we want to query different parts of data? How efficient is this?
  • What if we have concurrent accesses?
  • What data structures should we use?

Data Independence

  • Databases: apps shouldn’t have to worry about these problems!
  • Underlying storage format independent of application-level logic

Relational Data Stores

  • RDBMS: Relational Database Management System
  • Invented in the 1970s
  • e.g., Oracle, MySQL, Postgres, IBM DB2, Microsoft SQL Server
  • We are going to use Postgres in this class

Relational Model

  • Reason about sets of facts, or "tables"
  • Each fact is a "row"
  • Attributes are "columns" of row

For example

For example

SQL Query Language

  • High-level query language over tables
  • Declarative: say "what" you want computed, not "how"
  • Why is this good?

RDBMS uses SQL

  • Postgres is one of the most popular and versatile databases that are widely used
  • If you ever used shared hosting, you probably had the option of creating a Postgres databases
  • Traditionally a RDBMS, but it's object relational/extensible nature makes it great for surprising non-relational use cases

We are going to use Docker to run Postgres!

  • One of the great benefits of containerization is that you can easily spin up an instance of any service (database, queue, rest api, etc.) as long as the developers maintain a docker image of it
  • And Postgres has done that: most open source applications have some sort of docker image available (typically at hub.docker.com)

Before we start, a word about persistence

  • Right now anything that we save in our docker containers are lost once we stop the container
  • That is because containers are inherently empherial: they primary use CPU (processing) and RAM (temporary storage) to run
  • They can save files to disk (e.g. your code files), but those files are deleted when the container is removed

Before we start, a word about persistence

  • This isn't going to work with most use cases of databases 😀 
  • The entire point of a database is to store data and make it available to be retrieved and edited at a later time
  • Depending on a container to remain running in order to keep data is a very dangerous proposition
  • Imagine if all of the data on your computer was deleted every time you shut it down?

Docker's solution: volumes

  • A volume is an independent concept in Docker (just like images and containers)
  • "A Docker volume is an independent file system entirely managed by Docker and exists as a normal file or directory on the host, where data is persisted." (from https://earthly.dev/blog/docker-volumes/)
  • In essence you are telling docker to create an entity that refers to a folder on your local computer
  • This folder can then be mapped to a folder inside a running container
  • Anything saved in that folder inside the container is saved in the mapped local folder
  • Now we can run SQL commands:

Let's use volumes to start up a Postgres container

# Create a volume managed by docker. By default it is stored in a 
# subdirectory where docker is stored on your local computer
docker volume create demo-postgres-vol

# Create a container from the image postgres
docker run --name demo-postgres \
-e POSTGRES_USERNAME=postgres \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=demo_db \
-v demo-postgres-vol:/var/lib/postgresql/data \
-dit postgres

# See the running logs of your database container
docker logs -f demo-postgres

What's happening?

What's happening?

What's happening?

What's happening?

Awesome! Now how do I access the database

docker exec -it demo-postgres /bin/bash

Awesome! Now how do I access the database

psql -U postgres demo_db

Create Example

CREATE TABLE links (
    long_url VARCHAR(1000) NOT NULL,
    short_url VARCHAR(20) NOT NULL,
    hit_count BIGINT DEFAULT 0,
    created DATE
);

INSERT INTO links (long_url, short_url, hit_count, created) VALUES ('http://www.google.com', 'qwelmw', 2, '2016-04-05'), 
('http://www.twitter.com', 'adfer', 45, '2016-08-05');

Select Example

SELECT * FROM links;
long_url
 
short_url hit_count created
http://www.google.com
 
qwelmw 2 2016-4-5
http://www.facebook.com
 
adfer 45 2016-8-5

Select Example

SELECT * FROM links WHERE hit_count < 20;
long_url
 
short_url hit_count created
http://www.google.com
 
qwelmw 2 2016-4-5

Insert Example

SELECT * FROM links WHERE hit_count < 20;
long_url short_url hit_count created
http://www.google.com qwelmw 2 2016-4-5
http://www.twitter.com eovle 9 2016-10-28
INSERT INTO links VALUES 
	('http://www.twitter.com', 'eovle', 9, '2016-10-28');

Update Example

SELECT long_url, short_url, hit_count FROM links;
long_url short_url hit_count
http://www.google.com qwelmw 0
http://www.facebook.com adfer 0
http://www.twitter.com eovle 9
UPDATE links SET hit_count = '0' WHERE created < '2016-10-22';

Questions?