If possible, please have Instapoll open in another window/tab/app so that we don't lose too much time for the quizzes.
Just like last time, feel free to leave questions in the chat, use the "raise hand" button, or just unmute yourself and interrupt me.
We've added persistent storage to our computers to make sure that power-offs don't completely destroy all our data.
Disks are great!
Then we added glue code to make sure we could interact with this storage in a reasonable manner. This resulted in a filesystem.
We evaluated the filesystem based on three goals:
(made of many little pictures)
CPU
We start knowing the i# of
the root directory (usually 2)
We have enough space in memory to store two blocks worth of data
Everything else has to be requested from disk.
The request must be in the form of a block#. E.g. we can request "read block 27", but we cannot request "read next block" or "read next file"
You may assume that we already know the i# of the file header.
CPU
You may assume that we already know the i# of the file header.
CPU
You may assume that we already know the i# of the file header.
CPU
You may assume that we already know the i# of the file header.
CPU
You may assume that we already know the i# of the file header.
CPU
Instead of forcing users to refer to files by the inumbers, we created directories, which were mappings from file names to inumbers.
File Name | inode number |
---|---|
.bashrc | 27 |
Documents | 30 |
Pictures | 3392 |
.ssh | 7 |
Users can refer to files using paths, which are a series of directory names.
The root is a special directory, and it's name is "/".
Traversing paths from the root can require many, many disk lookups. We optimized this by maintaining a current working directory.
NTFS, the "New Technology File System," was released by Microsoft in July of 1993.
It remains the default filesystem for all Windows PCs--if you've ever used Windows, you've used an NTFS filesystem.
NTFS uses two new (to us!) ideas to track its files: extents and flexible trees.
Track a range of contiguous blocks instead of a single block.
Example: a direct-allocated file uses blocks 192,193,194,657,658,659
Using extents, we could store this as (192,3), (657, 3)
Files are represented by a variable-depth tree.
A Master File Table (MFT) stores the trees' roots
Not enough space for data!
Small file
Medium file
Large file
For really, really large files, even the attribute list might become nonresident!
$MFT (file 0) stores the Master File Table
MFT can start small and grow dynamically--to avoid fragmentation, NTFS reserves part of the volume for MFT expansion.
$Secure (file 9) stores access controls for every file (note: not stored directly in the file record!)
File is indexed by a fixed-length key, stored in the Std. Info field of the file record.
NTFS stores most metadata in ordinary files with well-known numbers
Last lecture, we made somewhat oblique references to some on-disk filesystem information.
On create(), the OS will allocate disk space, check quotas, permissions, etc.
There's also some fundamental information that we just need to keep in the filesystem!
We also need to handle data like file headers, free space, etc.
Partitions separate a disk into multiple filesystems.
Partitions can (effectively) be treated as independent filesystems occupying the same disk.
Not to scale
The Partition Table (stored as a part of the GPT Header) says where the various partitions are located.
Within the partition (filesystem), the superblock records information like where the inode arrays start, block sizes, and how to manage free disk space.
$ sudo e2fsck /dev/sdb1
Password:
e2fsck 1.42 (29-Nov-2011)
e2fsck: Superblock invalid, trying backup blocks...
e2fsck: Bad magic number in super-block while trying
to open /dev/sdb1
One of the things we need to track is which blocks on the disk are free/in use.
What techniques do we know of to do this?
Where else have we seen this technique?
It shows up everywhere!
But there's another allocation technique...
Let's say that we know ahead of time that we're going to be getting a lot of memory allocation requests between 4000 bytes and 4080 bytes.
Let's just lay out a bunch of regions in memory that are all 4096 bytes large! We'll call these fixed-size memory regions chunks.
When a request comes in, we can just grab the first free chunk and give it to whoever's making the request.
How much data do we need to record that a given chunk has been used?
Can we store this data without needing additional space?
i-th bit represents whether i-th chunk is used.
Our chunk size was 4096 bytes. Suppose we want 32GB to allocate with chunks (2^32 bytes). This is a lot of memory already!
How much memory do we need to dedicate to the bitmap?
This is a ratio of 32768:1.
To achieve this ratio with malloc(), the average allocation needs to be over 390,000 bytes.
It is much faster than linked list tracking (chasing linked lists is slow).
However, it cannot always allocate memory larger than one chunk (why)?, and can suffer from internal fragmentation.
It turns out in this case (as in many cases), Pintos does exactly what it should!
[1] e.g. palloc_get_page(), palloc_free_page()
Slab Allocator Property | Filesystem Goal |
---|---|
Very little extra space used to track free space | Minimize amount of space used for control structures |
Can only allocate in chunks | Can only access filesystem in blocks |
Cannot always allocate contiguous chunks | Does not require contiguous blocks (they're nice for extents, but not required) |
Can rapidly scan many chunks with a single read operation | Minimize the number of block reads needed to find free space |
Can track two different slabs easily -- just create two different bitmaps | Want to track free inodes and free data blocks separately |
Suffers from internal fragmentation | Was already suffering from internal fragmentation, so it doesn't really matter |
To create a file, we need to allocate one inode and one data block.
The superblock contains information about the filesystem: the type, the block size, and where the other important areas start (e.g. inode array)
A series of bitmaps are used to track free blocks separately for inodes and file data, a so-called zoned slab allocator.
Inode arrays contain important file metadata.
There may be backup superblocks scattered around the disk.
* many modern filesystems reserve about 10% of data blocks as "wiggle room" to optimize file locality.
Our primary measure of resilience will be consistency.
Consistency: Does my data agree with...itself?
There's a lot of work out there for guaranteeing the "correctness" of a file system after a "failure", for many different values of correctness and failure.
But if we can't even guarantee the data is consistent, what hope do we have of stronger resilience guarantees?
append(file, buf, 4096)
What changes do we need to make to the filesystem to service this syscall?
append(file, buf, 4096)
Block size is 4KB
So...write the data block first?
Disk caches are essentially the reason we can interact with the filesystem in (what you think of as) a reasonable amount of time.
I am an OS programmer who needs to deal with writes. Should I tell the user their write succeeded even if it only made it to a cache in RAM?
Write now. Immediately write changes back to disk, while keeping a copy in cache for future reads.
Guarantees consistency, but is slow. We have to wait until the disk confirms that the data is written.
Write later. Keeps changed copy in-memory (and future requests to read the file use the changed in-memory copy, but defers the writing until some later point (e.g. file close, page evicted, too many write requests)
Much better performance, but modified data can be lost in a crash, causing inconsistencies.
We've seen that a single append operation will not always result in a consistent filesystem. No ordering of write operations can prevent this.
This issue is made worse by the omnipresence of caching in the filesystem layers. Nearly every access is cached:
(no, it's not a naughty word!)
40m
Does not guarantee that blocks are written to disk in any order--the filesystem can appear to reorder writes. User applications that need internal consistency have to rely on other tricks to achieve that.
To assist with consistency, UNIX uses write-through caching for metadata. If multiple metadata updates are needed, they are performed in a predefined, global order.
This should remind you of another synchronization technique that we've seen in this class. Where else have we seen predefined global orders before?
If a crash occurs:
Suppose we're creating a file in a system using direct allocation. What do we need to write?
Data block for new file
inode for new file
inode bitmap
data bitmap
directory data
I have snapshots of several filesystems which were...unexpectedly terminated.
We think they were in the middle of creating a file when the interruption occurred. Some of them may have finished file creation. Others might not have. Fortunately, all these filesystems signed a pact to always update things in the same order.
The system is a direct-allocated filesystem using zoned bitmap tracking and a single global directory.
Ø is a null pointer. Uncolored inodes have all their pointers set to Ø.
Update order:
Blocks are zero-indexed (careful with orange/purple block!)
https://slides.com/chipbuster/18-fs-consistency/live
Work on System (Breakout Room # % 4) + 1
Update order:
Question: did it feel easy?
Update order:
...but its other data/metadata is intact! This must have failed after the data bitmap update and before the directory update.
We can recover everything about the file except for its name.
Update order:
The inodes are marked as used in the inode bitmap, so this must have failed between inodde and data bitmap updates.
We can recover everything about the file except for its name.
Update order:
What happened here?
Two possibilities:
Conclusion: failing before inode bitmap update is the same as failing in first two steps: looks like file creation never happened.
Not a very principled approach (what if there's some edge case we haven't thought about that breaks everything?)
Tricky to get correct: even minor bugs can have catastrophic consequences
Write-through caching on metadata leads to poor performance
Recovery is slow. At bare minimum, need to scan:
inode bitmap
data block bitmap
every inode (!!)
every directory (!!!)
Possibly even more work if something is actually inconsistent.
What are these things called?
Transactions group actions together so that they are:
To achieve these goals, we tentatively apply each action in the transaction.
Undoing writes on disk (rollback) is hard!
Exhibit A: All y'all that have really wished you could revert to a previous working version of Pintos at some point.
Instead, use transaction log!
TxBegin
Write Block32
Write Block33
Commit
TxFinalize
<modify blocks>
Step 1: Write each operation we intend to apply into the log, without actually applying the operation (write-ahead log)
Step 2: Write "Commit." The transaction is now made permanent.
Step 3: At some point in the future (perhaps as part of a cache flush or an fsync() call), actually make the changes.
Step 2: Write "Commit." The transaction is now made permanent.
Step 3: At some point in the future (perhaps as part of a cache flush or an fsync() call), actually make the changes.
Once the transaction is committed, the new data is the "correct" data.
So....somehow we need to present the new data even though it hasn't been written to disk?
Recovery of partial transactions can be completed using similar reasoning to fsck.
TxBegin
Write Block32
Write Block33
Commit
TxFinalize
<modify blocks>
TxBegin
Write Block32
Write Block33
Commit
TxBegin
Write Block32
Write Block33
Transaction was completed and all disk blocks were modified.
No need to do anything.
Transaction was committed, but disk blocks were not modified.
Replay transaction to ensure data is correct.
Transaction was aborted--data changes were never made visible, so we can pretend this never happened.
From the early 90s onwards, a standard technique in filesystem design.
All metadata changes that can result in inconsistencies are written to the transaction log (the journal) before eventually being written to the appropriate blocks on disk.
Eliminates need for whole-filesystem fsck. On failure, just check the journal! Each transaction recorded must fall into one of three categories:
We want to add a new block to file Iv1.
TxBegin
Commit
D2
B2
IV2
TxFinalize
Problem: Issuing 5 sequential writes ( TxBegin | Iv2 | B2 | D2 | Commit ) is kinda slow!
Solution: Issue all five writes at once!
Problem: Disk can schedule writes out-of-order!
First write TxBegin, Iv2, B2, TxEnd
Then write D2!
TxBegin
Commit
D2
B2
IV2
Solution: Force disk to write all updates by issuing a sync command between "D2" and "Commit". "Commit" must not be written until all other data has made it to disk.
TxBegin
Commit
D2
B2
IV2
Use the transaction technique we saw in the Advanced Synchronization lecture for all metadata updates (regular data still written with writeback caching).
This guarantees us reliability, and avoids the overhead of an fsck scan on the entire disk (only need to scan journal!)
But now we have to write all the metadata twice! Once for the journal and then a second time to get it in the right location.
COW systems, or "Copy-on-Write," are the hot new thing in production filesystems right now.
Copy-on-write is a rule for how data is written. Instead of overwriting the old value, a copy is made and the write is applied to the copy. Why would you want to do this? Well...
Unfortunately, Kevin ran out of time to make pretty pictures while making these slides, so you're getting a storybook straight out of "Kevin's Budget Animation Studio"
I have a thing!
I have a thing!
Can I see the thing?
Sure!
Can I see the thing?
Hmm, it's pretty important...better send a copy.
Thanks! ....This looks boring.
It...didn't even use the thing! I did all that copy work for nothing...
Hey, I see you have a thing! Can I see?
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hey, I see you have a thing! Can I see?
Sure, I guess....
Hmmm.....this isn't quite what I needed. Thanks for helping me though!
Sure, I guess....
Why am I doing all this work if I don't need to?
Hey, the others said you had a thing, can I see the thing?
You know what, just take the original.
Thanks! I wonder how it would look in red?
Thanks! I wonder how it would look in red?
Thanks! I wonder how it would look in red?
The left smiley had something that a lot of smileys wanted copies of.
Most of the copies were read-only copies...but it couldn't make other smileys promise that, so it started off by just making copies: if the other smiley modified the copy, the original was still intact.
Making copies was a lot of work, and most smileys weren't modifying the thing...so why not just send over the original?
Murphy's Law happened, and now the thing is changed.
I have a thing!
I have a thing!
Can I see the thing?
Sure, but you gotta promise to ask me if you want to change anything.
Okay!
Wow! Can I try painting a stripe?
Yeah, use that copy I made for you so you don't change the original!
Cool!
Thanks!
No problem!
Only works if we can stop user from directly writing the data.
In practice, there's no "asking permission" or "making promises"--the OS just intercepts the write and makes it do the right thing, similar to a page fault.
Allows us to make lots of "copies" very quickly, since there's actually no copy made, at a slight sacrifice to write speed.
Yay! Have we changed the tree?
Let's try this again.
Almost done...what else needs to change?
Almost done...what else needs to change?
Finally, declare that the new root node is the "real root."
Note: any any point before this very last change, it would look like the tree hadn't updated at all!
A BTRFS filesystem is just a bunch of trees, where the both the metadata and data are contained at blocks in the leaves, and the superblock contains the pointer to the root.
Superblock
Metadata?
Data?
Superblock
Superblock
Superblock
Superblock
Superblock
Superblock
Does this need to be deleted?
Superblock
Does this need to be deleted?
No! Can retain it to see what data looked like on disk previously: COW enables easy snapshots of data without having to make explicit copies!
The New Technology Filesystem uses two new ideas to manage its data:
The filesystem superblock points to the inode array, data arrays, and free space management on the filesystem.
The free space tracking usually uses zoned slab allocation, because its properties mesh very well with those of the filesystem.
The fact that disks can fail at any time leads us to potential data loss issues. The simplest of these is consistency, i.e. the metadata on a disk should agree with itself.
The in-memory disk cache makes this worse: we need this cache for reasonable performance, but it also makes consistency harder!
Three general solutions for this problem:
Updates are made in an agreed-upon order with write-through caching.
On failure, scan the entire stinking partition to see if any inconsistencies exist.
Slow to write (because of write-through caching), slow to recover, and easy to break!
Use transactions to record all metadata changes. Transactions are recorded in the journal.
On failure, scan the journal and fix up any partially completed transactions.
Fast recovery, but still requires that all metadata written to disk twice (once for journal, once for realsies)
Never change any data in-place, so that inconsistencies cannot arise.
To change files, make a copy of their data blocks and update the COW tree structure.
To switch to the new copy, simply change the root node in the superblock. Atomic!
Can retain old root nodes as snapshots.
Project 3 is groups are due today, 4/6
Project 3 is due 4/21, and the data structures are due Friday