Under The Hood

What drives Azure Storage

About Me

Blog: http://blog.codenova.pl/

Twitter: @Kamil_Mrzyglod

GitHub: https://github.com/kamil-mrzyglod

StackOverflow: https://stackoverflow.com/users/1874991/kamo

LinkedIn: www.linkedin.com/in/kamil-mrzygłód-31470376

Introduction

Cloud storage system for storing limitless amounts of data for any duration of time

Data stored durably using both local and geographic replication

Blobs, tables, queues

In production since 2008

Features

Strong consistency

Global and Scalable Namespace/Storage

Disaster recovery

Multi-tenancy and cost

Example

Tables

Blobs

350TB of data

40k transactions /sec

3B transactions / day

Queues

Global namespace

http(s)://AccountName.<service>.core.windows.net/PartitionNa me/ObjectName

DNS

Cluster

Individual object within partition

Architecture - Fabric Controller

Node management

Network configuration

Health monitoring

Starting/stopping services

Service deployment

Architecture - components

Architecture - stamp

Stamp 1

Stamp 2

Stamp 3

Data Center

Architecture - stamp #2

A cluster of N racks

Each rack built out as a separate fault domain

Typically from 10 to 20 racks, 18 disk-heavy storage nodes per rack

Holds from 2PBs to 30PBs of data

Utilization ~70%

When reaches 70% of utilization, inter-stamp replication starts

Architecture - Location Service(LS)

Location Service

Manages stamps

Manages account namespace

Chooses the primary stamp

Updates DNS to route from a URL to stamp's virtual IP

Architecture - Intra-Stamp replication

  • happens on Stream Layer
  • synchronous
  • make sure all data written into a stamp is kept durable
  • on the critical path of the customer's write requests
  • required to actually return a success to a customer
  • provides durability against hardware failures

Architecture - Inter-Stamp replication

  • happens on Partition Layer
  • asynchronous
  • replicates data across stamps
  • focused on replicating objects
  • used for keeping a copy of account's data
  • used also for migrating data between stamps
  • provides durability against geo-disasters

Layers - Front-End Layer

  • stateless servers that take incoming requests
  • routes a requests to a Partition Server in the Partition Layer
  • caches the Partition Map and use it to determine  which Partition Server route a request to
  • streams large objects directly from the Stream Layer
  • cahces frequently accessed data

Layers - Stream Layer

Block

  • minimum unit of data
  • up to N bytes
  • appended to an extent
  • blocks don't have to be the same size

Extent

  • unit of replication(3 replicas within a stamp)
  • consists of blocks
  • 1GB target size

Stream

  • looks like a big file
  • can be randomly read from
  • an ordered list of pointers to extents
  • only the last extent in a stream can be appended to(all prior extents in a stream are immutable)

Layers - Stream Layer #2

Stream Manager

  • keeps tracks what extents are in what stream
  • a standard Paxos cluster
  • mainstains the stream namespace
  • monitors the health of ENs
  • creates and assigns extends to ENs
  • performs the lazy re-replication of extents
  • garbage collects extents

Layers - Stream Layer #3

Extent Node

  • maintains the storage for a set of extent replicas
  • N disks attached, completely under control of EN
  • knows nothing about stream, deals only with extents and blocks
  • internally each extent is a file, which holds data blocks and their checksums + an index, which maps extent offsets to blocks
  • each EN contains a view about extents it owns
  • EN talk only to other ENs to replicate block writes
  • when an extents is no longer references, SM garbage collects it and notifies EN about this fact

Layers - Partition Layer

  • stores different types of objects
  • understands what transaction means
  • provides data model for different types of objects
  • provides logic
  • load balances access to objects
  • provides massively scalable namespace

Partition Layer - Object Tables

Object Table

Range Partition

Range Partition

Range Partition

Partition Server

Partition Server

Partition Server

Object tables

Examples

Account Table

Stores metadata and configuration for each account assigned to a stamp

Blob Table

Stores all blobs for all accounts within a stamp

Entity Table

Stores all entity rows for all accounts within a stamp

Message Table

Stores all message for all queues

Schema Table

Keeps track of the schema of all OTs

Partition Table

Keeps track of the current Range Partitions for all OTs and what Partition Server is serving each Range Partition

Partition Layer

Range partitions load balancing

Partition Manager

Load Balance

Split

Merge

Too much traffic on a PS, re-assign RangePartitions to less loaded PSs

RangePartition has too much load, split it into smaller ones and load balance across different PSs

Merge cold or lightly loaded Range Partitions

Layers - Partition Layer #2

  • responsible for keeping track and splitting massive Object Tables into Range Partitions
  • assigns Range Partitions to Partition Servers
  • ensures that Range Partition is assigned only to one Partition Server
  • each stamp has multiple instances of Partition Manager running

Partition Manager

Layers - Partition Layer #3

  • responsible for serving requests to a set of Range Partitions assigned to it by the PM
  • stores all the persistent state of the partitions
  • can concurrently server multiple Range Partitions from different Object Tables

Partition Server

Layers - Partition Layer #4

  • used for leader election for the Partition Manager
  • each PS also maintains a lease with the lock service in order to serve partitions.

Lock Service

Range Partition

Data Structure

Range Partition

Data Structure #2

Metadata Stream

  • root stream for a Range Partition
  • PM assigns a partition to a PS by providing the name of the Range Partition's stream
  • includes the name of the commit log stream and data streams and pointers to where to start operating on those streams

Range Partition

Data Structure #3

Commit Log Stream

  • stores recent operations like insert, delete, update applied to a Range Partition since the last checkpoint generated

Range Partition

Data Structure #4

Raw Data Stream

  • Stores the checkpoint row data and index for the Range Partition

Range Partition

Data Structure #4

Blob Data Stream

  • only for Blob Table
  • stores the blob data bits

Range Partition

Data Structure #5

Memory table

  • in memory version of the Commit Log
  • contains all th recent operations which have not been checkpointed yet to the Raw Data Stream

Range Partition

Data Structure #5

Index cache

  • stores the checkpoint indexes of the Raw Data Stream
  • it is separated from the Raw Data Stream to make sure it keeps as much of the main index cached in the memory as possible

Range Partition

Data Structure #6

Raw Data cache

  • memory cache of the checkpoint raw data pages
  • it's read-only
  • When a lookup

    occurs, both the row data cache and the memory table are checked, giving preference to the memory table

Range Partition

Data Structure #7

Bloom Filters

  • If the data is not found in the memory table or the row data cache, then the index/checkpoints in the data stream need to be searched
  • a bloom filter is kept for each checkpoint, which indicates if the row being accessed

    may be in the checkpoint

Range Partition

Data Flow

  1. On write request data is appended to the Commit Log
  2. Newly changed row is put into the Memory Table
  3. Success can be returned to a client(Front-End)
  4. When the size of the Memory Table/Commit Log reaches threshold, PS writes contents of the Memory Table into a checkpoint stored persistently in the Row Data stream
  5. Corresponding portion of the Commit Log can be then removed

References

http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf

https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction

Questions?

deck

By kamil_mrzyglod

deck

  • 464