Under The Hood

What drives Azure Storage

About Me

Blog: http://blog.codenova.pl/

Twitter: @Kamil_Mrzyglod

GitHub: https://github.com/kamil-mrzyglod

StackOverflow: https://stackoverflow.com/users/1874991/kamo

LinkedIn: www.linkedin.com/in/kamil-mrzygłód-31470376

Introduction

Cloud storage system for storing limitless amounts of data for any duration of time

Data stored durably using both local and geographic replication

Blobs, tables, queues

In production since 2008

Features

Strong consistency

Global and Scalable Namespace/Storage

Disaster recovery

Multi-tenancy and cost

Example

Tables

Blobs

350TB of data

40k transactions /sec

3B transactions / day

Queues

Global namespace

http(s)://AccountName.<service>.core.windows.net/PartitionNa me/ObjectName

DNS

Cluster

Individual object within partition

Architecture - Fabric Controller

Node management

Network configuration

Health monitoring

Starting/stopping services

Service deployment

Architecture - components

Architecture - stamp

Stamp 1

Stamp 2

Stamp 3

Data Center

Architecture - stamp #2

A cluster of N racks

Each rack built out as a separate fault domain

Typically from 10 to 20 racks, 18 disk-heavy storage nodes per rack

Holds from 2PBs to 30PBs of data

Utilization ~70%

When reaches 70% of utilization, inter-stamp replication starts

Architecture - Location Service(LS)

Location Service

Manages stamps

Manages account namespace

Chooses the primary stamp

Updates DNS to route from a URL to stamp's virtual IP

Architecture - Intra-Stamp replication

happens on Stream Layer
synchronous
make sure all data written into a stamp is kept durable
on the critical path of the customer's write requests
required to actually return a success to a customer
provides durability against hardware failures

Architecture - Inter-Stamp replication

happens on Partition Layer
asynchronous
replicates data across stamps
focused on replicating objects
used for keeping a copy of account's data
used also for migrating data between stamps
provides durability against geo-disasters

Layers - Front-End Layer

stateless servers that take incoming requests
routes a requests to a Partition Server in the Partition Layer
caches the Partition Map and use it to determine which Partition Server route a request to
streams large objects directly from the Stream Layer
cahces frequently accessed data

Layers - Stream Layer

Block

minimum unit of data
up to N bytes
appended to an extent
blocks don't have to be the same size

Extent

unit of replication(3 replicas within a stamp)
consists of blocks
1GB target size

Stream

looks like a big file
can be randomly read from
an ordered list of pointers to extents
only the last extent in a stream can be appended to(all prior extents in a stream are immutable)

Layers - Stream Layer #2

Stream Manager

keeps tracks what extents are in what stream
a standard Paxos cluster
mainstains the stream namespace
monitors the health of ENs
creates and assigns extends to ENs
performs the lazy re-replication of extents
garbage collects extents

Layers - Stream Layer #3

Extent Node

maintains the storage for a set of extent replicas
N disks attached, completely under control of EN
knows nothing about stream, deals only with extents and blocks
internally each extent is a file, which holds data blocks and their checksums + an index, which maps extent offsets to blocks
each EN contains a view about extents it owns
EN talk only to other ENs to replicate block writes
when an extents is no longer references, SM garbage collects it and notifies EN about this fact

Layers - Partition Layer

stores different types of objects
understands what transaction means
provides data model for different types of objects
provides logic
load balances access to objects
provides massively scalable namespace

Partition Layer - Object Tables

Object Table

Range Partition

Partition Server

Object tables

Examples

Account Table

Stores metadata and configuration for each account assigned to a stamp

Blob Table

Stores all blobs for all accounts within a stamp

Entity Table

Stores all entity rows for all accounts within a stamp

Message Table

Stores all message for all queues

Schema Table

Keeps track of the schema of all OTs

Partition Table

Keeps track of the current Range Partitions for all OTs and what Partition Server is serving each Range Partition

Partition Layer

Range partitions load balancing

Partition Manager

Load Balance

Split

Merge

Too much traffic on a PS, re-assign RangePartitions to less loaded PSs

RangePartition has too much load, split it into smaller ones and load balance across different PSs

Merge cold or lightly loaded Range Partitions

Layers - Partition Layer #2

responsible for keeping track and splitting massive Object Tables into Range Partitions
assigns Range Partitions to Partition Servers
ensures that Range Partition is assigned only to one Partition Server
each stamp has multiple instances of Partition Manager running

Partition Manager

Layers - Partition Layer #3

responsible for serving requests to a set of Range Partitions assigned to it by the PM
stores all the persistent state of the partitions
can concurrently server multiple Range Partitions from different Object Tables

Partition Server

Layers - Partition Layer #4

used for leader election for the Partition Manager
each PS also maintains a lease with the lock service in order to serve partitions.

Lock Service

Range Partition

Data Structure

Range Partition

Data Structure #2

Metadata Stream

root stream for a Range Partition
PM assigns a partition to a PS by providing the name of the Range Partition's stream
includes the name of the commit log stream and data streams and pointers to where to start operating on those streams

Range Partition

Data Structure #3

Commit Log Stream

stores recent operations like insert, delete, update applied to a Range Partition since the last checkpoint generated

Range Partition

Data Structure #4

Raw Data Stream

Stores the checkpoint row data and index for the Range Partition

Range Partition

Data Structure #4

Blob Data Stream

only for Blob Table
stores the blob data bits

Range Partition

Data Structure #5

Memory table

in memory version of the Commit Log
contains all th recent operations which have not been checkpointed yet to the Raw Data Stream

Range Partition

Data Structure #5

Index cache

stores the checkpoint indexes of the Raw Data Stream
it is separated from the Raw Data Stream to make sure it keeps as much of the main index cached in the memory as possible

Range Partition

Data Structure #6

Raw Data cache

memory cache of the checkpoint raw data pages
it's read-only
When a lookup
occurs, both the row data cache and the memory table are checked, giving preference to the memory table

Range Partition

Data Structure #7

Bloom Filters

If the data is not found in the memory table or the row data cache, then the index/checkpoints in the data stream need to be searched
a bloom filter is kept for each checkpoint, which indicates if the row being accessed
may be in the checkpoint

Range Partition

Data Flow

On write request data is appended to the Commit Log
Newly changed row is put into the Memory Table
Success can be returned to a client(Front-End)
When the size of the Memory Table/Commit Log reaches threshold, PS writes contents of the Memory Table into a checkpoint stored persistently in the Row Data stream
Corresponding portion of the Commit Log can be then removed

Under The Hood

What drives Azure Storage

About Me

Introduction

Features

Example

Global namespace

Architecture - Fabric Controller

Architecture - components

Architecture - stamp

Architecture - stamp #2

Architecture - Location Service(LS)

Architecture - Intra-Stamp replication

Architecture - Inter-Stamp replication

Layers - Front-End Layer

Layers - Stream Layer

Layers - Stream Layer #2

Layers - Stream Layer #3

Layers - Partition Layer

Partition Layer - Object Tables

Object tables

Examples

Partition Layer

Range partitions load balancing

Layers - Partition Layer #2

Layers - Partition Layer #3

Layers - Partition Layer #4

Range Partition

Data Structure

Range Partition

Data Structure #2

Range Partition

Data Structure #3

Range Partition

Data Structure #4

Range Partition

Data Structure #4

Range Partition

Data Structure #5

Range Partition

Data Structure #5

Range Partition

Data Structure #6

Range Partition

Data Structure #7

Range Partition

Data Flow

References

Questions?

deck

More from kamil_mrzyglod