Live presentation link: https://slides.com/chipbuster/deck-e91d0b/live
Instapoll
If you intend to claim synchronous lecture credit, please have Instapoll open in a browser tab so that we don't have to wait for the service to start up.
These slides are written using the reveal.js editor at slid.es. You can follow along with a live presentation at https://slides.com/chipbuster/deck-e91d0b/live
(At this point, Kevin should make sure to post the link in the Zoom chat).
This link may be easier to use/more fluid if you have a low-bandwidth connection, since it needs to transmit less data. The downside is that it might be very slightly out of sync with the audio.
You will be able to follow along with my mouse movements at this link.
I will also be sharing a standard video feed in Zoom.
I will be attempting to monitor chat while giving this lecture (I have no idea how successfully). If you have a question, you can ask it there and I'll try to respond.
You can also use the raise-hand and other reactions to communicate with me, or just interrupt me if you can't use those.
(Almost) everything we've discussed so far occurs in main memory (RAM):
RAM is nice! It's relatively speedy, and you can store a lot of stuff in there.
....but it's got a major drawback.
RAM is not persistent! If the power gets cut off, all data in main memory is lost.
To solve this problem, we introduce stable storage (disks). These devices retain data even after the power to them has been shutoff (i.e. they are persistent).
Now we can turn off the computer without losing important data!
Yay!
But now we have to deal with communication between the CPU and disk, which is very different from communication with main memory!
Memory | Disk | |
---|---|---|
How to Access | ||
Time for Request | ||
While Waiting |
|
LOAD/STORE instructions
Special I/O Access Instructions
A VERY LONG TIME
Program Blocks (can be scheduled)
Program stalls (still on CPU)
A long time
(Almost) everything we've discussed so far occurs in main memory (RAM):
Now we've added persistent storage which is large, very slow, and order-dependent.
For a relatively fast hard drive, the average data access has a latency (request made to data available) of 12.0 ms.
Action | Latency | 1ns = 1s | Conversation Action |
---|---|---|---|
Values taken from "Numbers Every Programmer Should Know", 2020 edition
1s
4s
1m 40s
138d 21hr+
1ns
4ns
100ns
12ms
L1 Cache Hit
L2 Cache Hit
Cache Miss*
* a.k.a. Main Memory Reference
Disk Seek
Talking Normally
Long Pause
Go upstairs to ask Dr. Norman
Walk to Washington D.C. and back
DISK ACCESS
Other Destinations:
Unfortunately, we cannot always avoid paying the disk access cost (if we could, we wouldn't need the disk!)
DISK ACCESS
But we can use lots of common systems design tricks to make sure this doesn't affect us too badly!
Read account A from disk
Read account B from disk
Add $100 to account A
Subtract $100 from account B
Write new account B to disk
Write new account A to disk
Read account A from disk
Read account B from disk
Add $100 to account A
Subtract $100 from account B
Write new account B to disk
Write new account A to disk
The filesystem consists of raw block numbers. User programs are responsible for keeping track of which blocks they use, and for making sure they don't overwrite other program's blocks.
For example, the following filesystem is easy to implement, and is as fast and as consistent as the user chooses to make it:
A file is named by the a hash of its contents. All files are in the root directory. Filenames cannot be changed.
How fast is this design? How many disk accesses do we need in order to perform common operations?
Will this design cause application programmers to tear their hair out? Does it require some deep knowledge of the system or is that abstracted away?
Can this design become corrupted if the computer fails (e.g. through sudden power loss), or if small pieces of data are damaged? Can it be recovered? How fast is the recovery procedure?
Data is stored on the disk as a bunch of blocks. A block is the smallest unit the filesystem can read/write. Blocks are identified by their order on the disk (e.g. #3124 is the block after #3123)
Data is stored on the disk as a bunch of blocks. A block is the smallest unit the filesystem can read/write. Blocks are identified by their order on the disk (e.g. #3124 is the block after #3123)
We need some way to describe how the data is organized. These are the metadata blocks.
Once you have the file metadata, you know everything you need to access the file.
We need some way to describe how the data is organized. These are the metadata blocks.
Once you have the file metadata, you know everything you need to access the file.
There are some in-memory structures that the OS uses to track what's happening with the filesystem (e.g. which files are open, synchronization tools).
There are some in-memory structures that the OS uses to track what's happening with the filesystem (e.g. which files are open, synchronization tools).
There are also some per-process pieces of information that need to be tracked in-memory.
Where does this information live? What's an example that you've worked with before?
There are also some per-process pieces of information that need to be tracked in-memory.
Where does this information live? What's an example that you've worked with before?
Finally, we need to worry about how a user program accesses all of this!
Finally, we need to worry about how a user program accesses all of this!
So...how does a user program access system services?
?
So...how does a user program access system services?
Syscalls! create(), read(), write(), etc.
Since the application thinks in terms of filenames and open, the filesystem is responsible for translating user requests into something the lower levels can use, e.g. "the 700th byte of ~/.bashrc" might translate to "block #52730"
YOU ARE HERE
Creates a new file with some metatdata and a name.
Creates a new file with some metatdata and a name.
On create(), the OS will:
create(const char* filename);
Creates a hard link--a user-friendly name for some underlying file.
On link(), the OS will:
link(const char* name, struct inode* inode);
This new name points to the same underlying file!
Removes an existing hard link.
To delete() a file, the OS needs to:
unlink(const char* name);
The OS decrements the number of links in the file metadata. If the link count is zero after unlink, the OS can delete the file and all its resources.
Creates in-memory data structures used to manage open files. Returns integer to the caller.
open(const char* name, enum mode);
On open(), the OS needs to:
struct open_file {
struct file_header* metadata;
file_offset pos;
int file_mode; //e.g. "r" or "rw"
};
open(const char* name, enum mode);
On close(), the OS needs to:
read(file_id, file_pos, num_bytes, bufAddress)
On read(), the OS needs to:
Filesystem can also provide a read call that uses the file_pos in the open file structure.
A. Yes, Yes
B. Yes, No
C. No, Yes
D. No, No
YOU ARE HERE
YOU ARE HERE
Metadata: the file header contains information that the operating system cares about: where the file is on the disk, and attributes of the file.
Metadata for all files is stored at a fixed location (something known by the OS) so that they can be accessed easily.
Data is the stuff the user actually cares about. It consists of sectors of data placed on disk.
Examples: file owner, file size, file permissions, creation time, last modified time, location of data blocks.
META
DATA
DATA
Logical layout of a file (not necessarily how it's placed on disk!)
META
DATA
DATA
Logical layout of a file (not necessarily how it's placed on disk!)
Assume we already know where the metadata is.
Most files on a computer are small!
So we should have good support for lots of small files!
The user probably cares about accessing large files (they might be saved videos, or databases), so large file access shouldn't be too slow!
Most disk space is used by large files.
How many disk reads do we need to access a particular block?
CPU
We start knowing the block # of the appropriate file header
We have enough space in memory to store two blocks worth of data
Everything else has to be requested from disk.
The request must be in the form of a block#. E.g. we can request "read block 27", but we cannot request "read next block" or "read next file"
How many disk reads do we need to access a particular block?
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
How many disk reads do we need to access a particular block?
CPU
How many disk reads to access the first data block?
(We always start with the block# of the file header)
How many disk reads do we need to access a particular block?
CPU
How many disk reads to access the first data block?
(We always start with the block# of the file header)
How many disk reads do we need to access a particular block?
CPU
How many disk reads to access the first data block?
(We always start with the block# of the file header)
How many disk reads do we need to access a particular block?
Let's say I want to read only the third block.
This method is very simple (this is good!)
How fast is sequential access?
How fast is random access?
What if we want to grow a file?
How bad is fragmentation?
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
How fast is sequential access? Is it always good?
How bad is fragmentation?
What if we want to grow the file?
How fast is random access?
What happens if a disk block becomes corrupted? (what sort of problem is this?)
File Allocation Table (FAT)
Started with MS-DOS (Microsoft, late 70s)
Descendants include FATX and exFAT
Simple!
File header points to each data block directly (that's it!)
How fast is sequential access? How about random access?
How bad is fragmentation?
What if we want to grow the file?
Does this support small files? How about large files?
What if other file metadata takes up most of the space in the header?
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
IB
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
IB
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
IB
CPU
How many disk reads to access the first block?
(We always start with the block# of the file header)
IB
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
IB
CPU
How many disk reads to access the third block?
(We always start with the block# of the file header)
IB
How fast is sequential access? How about random access?
How bad is fragmentation?
What if we want to grow the file?
Does this support small files? How about large files?
Pointer Type | # Ref Blocks | Total Size At Level |
---|---|---|
Direct | 10 * 1 = 10 | 40 KB |
Single Indirect | 1 * 512 = 512 | 2 MB |
Double Indirect | 512 * 512 = 2^18 | 1 GB |
Triple Indirect | 512*512*512 = 2^24 | 512 GB |
Total | ~513 GB |
FFS: assuming 4KB blocks, 8-byte pointers = 512 pointers/block
YOU ARE HERE
YOU ARE HERE
Or: All About Directories
We know how to get the data associated with a file if we know where its metadata (file header) is. We also know how to identify file headers (by their index in the file header array).
To edit your shell configuration, open file 229601, unless you have Microsoft Word installed, in which case you need to edit file 92135113
To edit your shell configuration, open file 229601, unless you have Microsoft Word installed, in which case you need to edit file 92135113
Use one name space for the entire disk.
File Name | inode number |
---|---|
.user1_bashrc | 27 |
.user2_bashrc | 30 |
firefox | 3392 |
.bob_bashrc | 7 |
(Yeah, it's not that great of an improvement)
File Name | inode number |
---|---|
.bashrc |
30 |
Documents | 173 |
File Name | inode number |
---|---|
.bashrc | 391 |
failed_projects | 8930 |
zsh |
3392 |
Note: the i# in a directory entry may refer to another directory!
The OS keeps a special bit in the inode to determine if the file is a directory or a normal file.
There is a special root directory (usually inumber 0, 1, or 2).
i# | Filename |
---|---|
3226 | .bashrc |
251 | Documents |
7193 | pintos |
2086 | todo.txt |
1793 | Pictures |
2B
Example directory with 16B entries
14B
To find the data blocks of a file, we need to know where its inode (file header) is.
To find an inode (file header), we need to know its inumber.
To find a file's inumber, read the directory that contains the file.
The directory is just a file, so we need to find its data blocks.
We can break the loop here by agreeing on a fixed inumber for a special directory.
It should be possible to reach every other file in the filesystem from this directory.
On most UNIX systems, the root directory is inumber 2
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
struct open_file {
struct file_header* metadata;
file_offset pos;
int file_mode; //e.g. "r" or "rw"
};
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
CPU
But we do have....what?
CPU
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
CPU
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
CPU
2713 | tmp |
2011 | bin |
3301 | usr |
99 | etc |
11 | home |
426 | var |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 1214
CPU
2713 | tmp |
2011 | bin |
3301 | usr |
99 | etc |
11 | home |
426 | var |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 1214
CPU
2713 | tmp |
2011 | bin |
3301 | usr |
99 | etc |
11 | home |
426 | var |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 1214
CPU
2713 | tmp |
2011 | bin |
3301 | usr |
99 | etc |
11 | home |
426 | var |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 1214
CPU
6 | user1 |
394 | user2 |
2201 | admin |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 2772
CPU
6 | user1 |
394 | user2 |
2201 | admin |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 2772
CPU
6 | user1 |
394 | user2 |
2201 | admin |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 2772
CPU
6 | user1 |
394 | user2 |
2201 | admin |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 2772
CPU
273 | Documents |
94 | .ssh |
2201 | .bash_profile |
4 | .bashrc |
61 | .vimrc |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 537
CPU
273 | Documents |
94 | .ssh |
2201 | .bash_profile |
23 | .bashrc |
61 | .vimrc |
int config_fd = open("/home/user1/.bashrc", O_RDONLY);
B 537
We didn't even try to read anything out of the file--that was just an open() call!
Maintain the notion of a per-process current working directory.
Users can specify files relative to the CWD
We can't avoid this disk access...
OS caches the data blocks of CWD in the disk cache
Store CWD data block here so that we don't have to go to disk 6x to get it.
Without persistent storage, computers are very annoying to use.
Persistent storage requires a different approach to organizing and storing data, due to differences in its behavior (speed, resilience, request ordering). This leads naturally to the idea of a file system.
When designing filesystems, we care about three properties:
We should use these three properties to guide our design choices.
Use of the filesystem involves the filesystem API, in-memory bookkeeping structures, and the structure of data on disk. All three need to be considered when designing a filesystem.
File headers describe how file data can be found. Part of this is the choice for file layout. Different layouts give different tradeoffs in speed, extensibility, etc.
Finally, the filesystem gives users an option for organizing files using directories. Directories are just files containing mappings from names to other files. Traversing directories can be expensive.
The filesystem exposes various syscalls for applications to work with files, e.g. create(), open(), read(). These syscalls manipulate both the state of bits on the disk and some in-memory data structures (open file table, file records).