Under The Hood
What drives Azure Storage
About Me

Blog: http://blog.codenova.pl/
Twitter: @Kamil_Mrzyglod
GitHub: https://github.com/kamil-mrzyglod
StackOverflow: https://stackoverflow.com/users/1874991/kamo
LinkedIn: www.linkedin.com/in/kamil-mrzygłód-31470376
Introduction
Cloud storage system for storing limitless amounts of data for any duration of time
Data stored durably using both local and geographic replication
Blobs, tables, queues
In production since 2008

Features
Strong consistency
Global and Scalable Namespace/Storage
Disaster recovery
Multi-tenancy and cost

Example



Tables
Blobs
350TB of data
40k transactions /sec
3B transactions / day
Queues
Global namespace
http(s)://AccountName.<service>.core.windows.net/PartitionNa me/ObjectName
DNS
Cluster
Individual object within partition
Architecture - Fabric Controller
Node management
Network configuration
Health monitoring
Starting/stopping services
Service deployment
Architecture - components
Architecture - stamp
Stamp 1
Stamp 2
Stamp 3
Data Center
Architecture - stamp #2
A cluster of N racks
Each rack built out as a separate fault domain
Typically from 10 to 20 racks, 18 disk-heavy storage nodes per rack
Holds from 2PBs to 30PBs of data
Utilization ~70%
When reaches 70% of utilization, inter-stamp replication starts
Architecture - Location Service(LS)
Location Service
Manages stamps
Manages account namespace
Chooses the primary stamp
Updates DNS to route from a URL to stamp's virtual IP
Architecture - Intra-Stamp replication
- happens on Stream Layer
- synchronous
- make sure all data written into a stamp is kept durable
- on the critical path of the customer's write requests
- required to actually return a success to a customer
- provides durability against hardware failures
Architecture - Inter-Stamp replication
- happens on Partition Layer
- asynchronous
- replicates data across stamps
- focused on replicating objects
- used for keeping a copy of account's data
- used also for migrating data between stamps
- provides durability against geo-disasters
Layers - Front-End Layer
- stateless servers that take incoming requests
- routes a requests to a Partition Server in the Partition Layer
- caches the Partition Map and use it to determine which Partition Server route a request to
- streams large objects directly from the Stream Layer
- cahces frequently accessed data
Layers - Stream Layer
Block
- minimum unit of data
- up to N bytes
- appended to an extent
- blocks don't have to be the same size
Extent
- unit of replication(3 replicas within a stamp)
- consists of blocks
- 1GB target size
Stream
- looks like a big file
- can be randomly read from
- an ordered list of pointers to extents
- only the last extent in a stream can be appended to(all prior extents in a stream are immutable)
Layers - Stream Layer #2
Stream Manager
- keeps tracks what extents are in what stream
- a standard Paxos cluster
- mainstains the stream namespace
- monitors the health of ENs
- creates and assigns extends to ENs
- performs the lazy re-replication of extents
- garbage collects extents
Layers - Stream Layer #3
Extent Node
- maintains the storage for a set of extent replicas
- N disks attached, completely under control of EN
- knows nothing about stream, deals only with extents and blocks
- internally each extent is a file, which holds data blocks and their checksums + an index, which maps extent offsets to blocks
- each EN contains a view about extents it owns
- EN talk only to other ENs to replicate block writes
- when an extents is no longer references, SM garbage collects it and notifies EN about this fact
Layers - Partition Layer
- stores different types of objects
- understands what transaction means
- provides data model for different types of objects
- provides logic
- load balances access to objects
- provides massively scalable namespace
Partition Layer - Object Tables
Object Table
Range Partition
Range Partition
Range Partition
Partition Server
Partition Server
Partition Server
Object tables
Examples
Account Table
Stores metadata and configuration for each account assigned to a stamp
Blob Table
Stores all blobs for all accounts within a stamp
Entity Table
Stores all entity rows for all accounts within a stamp
Message Table
Stores all message for all queues
Schema Table
Keeps track of the schema of all OTs
Partition Table
Keeps track of the current Range Partitions for all OTs and what Partition Server is serving each Range Partition
Partition Layer
Range partitions load balancing
Partition Manager
Load Balance
Split
Merge
Too much traffic on a PS, re-assign RangePartitions to less loaded PSs
RangePartition has too much load, split it into smaller ones and load balance across different PSs
Merge cold or lightly loaded Range Partitions
Layers - Partition Layer #2
- responsible for keeping track and splitting massive Object Tables into Range Partitions
- assigns Range Partitions to Partition Servers
- ensures that Range Partition is assigned only to one Partition Server
- each stamp has multiple instances of Partition Manager running
Partition Manager
Layers - Partition Layer #3
- responsible for serving requests to a set of Range Partitions assigned to it by the PM
- stores all the persistent state of the partitions
- can concurrently server multiple Range Partitions from different Object Tables
Partition Server
Layers - Partition Layer #4
- used for leader election for the Partition Manager
- each PS also maintains a lease with the lock service in order to serve partitions.
Lock Service
Range Partition
Data Structure
Range Partition
Data Structure #2
Metadata Stream
- root stream for a Range Partition
- PM assigns a partition to a PS by providing the name of the Range Partition's stream
- includes the name of the commit log stream and data streams and pointers to where to start operating on those streams
Range Partition
Data Structure #3
Commit Log Stream
- stores recent operations like insert, delete, update applied to a Range Partition since the last checkpoint generated
Range Partition
Data Structure #4
Raw Data Stream
- Stores the checkpoint row data and index for the Range Partition
Range Partition
Data Structure #4
Blob Data Stream
- only for Blob Table
- stores the blob data bits
Range Partition
Data Structure #5
Memory table
- in memory version of the Commit Log
- contains all th recent operations which have not been checkpointed yet to the Raw Data Stream
Range Partition
Data Structure #5
Index cache
- stores the checkpoint indexes of the Raw Data Stream
- it is separated from the Raw Data Stream to make sure it keeps as much of the main index cached in the memory as possible
Range Partition
Data Structure #6
Raw Data cache
- memory cache of the checkpoint raw data pages
- it's read-only
-
When a lookup
occurs, both the row data cache and the memory table are checked, giving preference to the memory table
Range Partition
Data Structure #7
Bloom Filters
- If the data is not found in the memory table or the row data cache, then the index/checkpoints in the data stream need to be searched
-
a bloom filter is kept for each checkpoint, which indicates if the row being accessed
may be in the checkpoint
Range Partition
Data Flow
- On write request data is appended to the Commit Log
- Newly changed row is put into the Memory Table
- Success can be returned to a client(Front-End)
- When the size of the Memory Table/Commit Log reaches threshold, PS writes contents of the Memory Table into a checkpoint stored persistently in the Row Data stream
- Corresponding portion of the Commit Log can be then removed
References
http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf
https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction
Questions?
deck
By kamil_mrzyglod
deck
- 464