Version: 3

System programming

Education

Lecture 3:

Memory. Virtual and physical. Cache levels, cache line. User space and kernel space memory. False sharing.

Lecture plan

  • CPU registers
  • CPU cache
    • Types
    • False sharing
    • Attacking
  • Main memory (RAM)
  • Virtual memory
    • Mapping to physical memory
    • Hardware part
    • Kernel part
  • On top of virtual memory
  • Homework

Process

.text

.data

.stack

.heap

.stack

.stack

File descriptors

Signal queue

IPC

Memory

Memory levels

Registers

Cache L1

Cache LN

Main memory

Flash memory

Magnetic memory

Speed

Volume

Registers [1]

  • The fastest memory
  • Access speed - one processor tick
  • No addresses = no virtualization = direct access = speed
  • Expensive - occupies physical space on processor, intricate microcircuit
  • Fixed in size, hardwired on processor

Registers [2]

Registers [3]

Trigger to save a value

Value to save

Saved value

NAND - Not And

Static Random Access Memory - SRAM

pc

mar

ir

asid

eax

ebx

ecx

edx

esi

...

<=128 bits

Cache [1]

Cache [2]

Time locality

Space locality

for (int i = 0; i < count; ++i)
{
    /* ... cache 'i' */
}
char buffer[128];
/* ... cache buffer */
read(buffer, 0, 64);
/* ... */
read(buffer, 64, 128);
/* ... */
read(buffer, 32, 96);
/* ... */

Cache [3]

Cache Size Access speed
L1 tens of Кб ~1 tick, < nanosecond
L2 ones of Мб tens of ticks, ones of nanoseconds
L3 tens of Мб hundreds of ticks, tens of nanoseconds

Cache [4]

Why so big difference in access speeds to cache levels?

Bigger size - longer search. Farther from processor - electricity goes notably longer.

1 point

Cache [5]

1 - in a big cache search is longer, many comparisons

c = 299 792 458\ m/s;\\ v = 3GHz;\\ 1\ tick = 1/3 ns; \\ 1\ tick * c ~= 10 sm

2 - going to a place in the cache and back really costs time

Cache [6]

Main memory

Bus

Cache L3

Cache L2

Cache L1

Processor core

Processor chip

Cache inclusion policies

Inclusive

L3

L2

L1

Exclusive

L3

L2

L1

NINE

L3

L2

L1

Inclusive cache

L1

L2

Х

A processor reads Х ...

Read from L1

L1

L2

Х

Copy into L1

Х

L1

L2

Х

Load into L1 and L2

Х

Y

Y

Read from L1

Fix inclusion

Read from L1

Y

Х

Exclusive cache [1]

L1

L2

Х

Read from L1

L1

L2

Х

Move to L1

Х

L1

L2

Х

Load into L1

Read from L1

Read from L1

A processor reads Х ...

Exclusive cache [2]

How are filled L2, L3 exclusive cache levels?

With records evicted from the previous level

L1

L2

Х

Evict old into L2, load new into L1

Read from L1

Y

Y

1 point

NINE cache

Not Inclusive Not Exclusive

L1

L2

Х

Read from L1

L1

L2

Х

Load into L1 and L2

Read from L1

L1

L2

Х

Copy into L1

Х

Read from L1

Y

Y

X

Same data in L1 and L2 - not exclusive

No eviction from L1 to satisfy inclusion with L2 - not inclusive

Y

Cache write policies

L3

L2

L1

Main memory

Write through

Need to wait when write is done

L3

L2

L1

Main memory

Write back

Set dirty bit

Need to sync caches between processor cores

Cache schema [1]

64 B

64 B

64 B

64 B

64 B

...

addr1

addr2

addr3

addr4

addrN

char *
read_from_cache(unsigned long addr)
{
    /* This is machine code, 64 bit. */
    unsigned offset = addr & 63;
    unsigned cache_addr = addr ~ offset;
    char *line = lookup_line(cache_addr);
    return line + offset;
}

Address mapping policy

Storage and reading in blocks

Cache schema [2]

64 bytes

Cache line:

Address layout:

tag

tag

offset

6 bits

policy helper

  • Fully-associative cache
  • Direct mapping cache
  • N-associative cache

user data

Fully-associative cache

Addresses

Cache

...

Address

tag

offset

6 bits

58 bits

Hardware schema

Direct mapping cache

Addresses

Cache

Address

tag

offset

6 bits

26 bits

Hardware schema

index

32 bits

N-associative cache [1]

Addresses

Cache

Address

tag

offset

6 bits

Hardware schema

index

...

58 - log_{2}{r}
log_{2}{r}
r

- cache line count

N-associative cache [2]

Gap declines and almost disappears

Instruction cache [1]

<opcode> := <argcount> <argtypes> <indirect_args>
            <ret_type> <target_dev>

Argument count

Argument types

Implicit arguments

Return type

Where to execute?

Instruction cache [2]

CPU Pipeline:

  1. ​Instruction fetch
  2. Instruction decode
  3. Execute
  4. Memory access
  5. Register write back

Multiprocessor cache [1]

Shared cache?

  • need to move out from chip
  • contention for access

Separate cache?

  • need to sync access - provide cache coherency

Multiprocessor cache [2]

  • Modified - processor-owner modified a line, but it is not loaded into another caches;
  • Exclusive - a line is not modified and is not loaded into another caches;
  • Shared - a line is not modified, but can be loaded into another caches;
  • Invalid - a line is free, no valid data.

False sharing [1]

struct my_object {
    int32_t a;
    int32_t b;
    int32_t c;
    int32_t d;
};

struct my_object object;

void
thread1_func()
{
    while (true) {
        do_something(object.a);
        do_something(object.b);
    }
}

void
thread2_func()
{
    while (true) {
        write_into(&object.c);
        write_info(&object.d);
    }
}

a, b

c, d

Cache line

8 bytes

8 bytes

48 bytes

Reading thread

Writing thread, invalidates the cache

False sharing [2]

struct my_object {
    int32_t a;
    int32_t b;
    char padding[56];
    int32_t c;
    int32_t d;
};

a, b

padding

Cache lines

8 bytes

56 bytes

Reading thread

Writing thread

c, d

8 bytes

56 bytes

Since C++17 can use

std::hardware_destructive_interference_size.

Meltdown [1]

Now here

Execution in advance, speculation, branch prediction

Privileges check after speculation

Meltdown [2]

static jmp_buf jmp;

void
process_signal(int code)
{
	printf("Process SIGSEGV, code = %d, "\
               "SIGSEGV = %d\n", code, SIGSEGV);
	longjmp(jmp, 1);
}

int
main()
{
	char *p = NULL;
	signal(SIGSEGV, process_signal);
	printf("Before SIGSEGV\n");
	if (setjmp(jmp) == 0)
		*p = 100;
	printf("After SIGSEGV\n");
	return 0;
}

It is not hard to get segfault - just access invalid memory

vladislav$> gcc 2_catch_sigsegv.c

vladislav$> ./a.out
Before SIGSEGV
Process SIGSEGV, code = 11, SIGSEGV = 11
After SIGSEGV

vladislav$>

Set a handler on SIGSEGV, which ignores it

Meltdown [3]

char userspace_array[256 * 4096];
char kernel_byte_value;

char
get_kernel_byte(const char *kernel_addr)
{
    clear_cache_for(userspace_array);
    register_exception_handler(process_exception);

    char index = *kernel_addr;
    /* Next code is for speculation. */
    char unused = userspace_array[index * 4096];
    /* Next code is after exception handler. */
    return kernel_byte_value;
}


void
process_exception()
{
    uint min_time = UINT_MAX;
    for (char i = 0; i < 256; ++i) {
        uint start = time();
        char unused = userspace_array[i * 4096];
        uint duration = time() - start;
        if (duration < min_time) {
            min_time = duration;
            kernel_byte_value = i;
        }
    }
}

Location to read kernel memory into

Cleanup caches, because the attack needs them empty

Read a forbidden address into a variable, which then is used to read another memory location

The kernel throws a segfault, but the needed element is already loaded into a cache. Its index is a value from the kernel memory

PROFIT

Cache profit

AAT = hit\_time + miss\_rate * miss\_penalty

Average Access Time

AAT = hit_{cache} + miss_{cache} * ram\_time

Assume these pessimistic values:

ram\_time = 100ns\\ miss_{L1\_cache} = 10\%\\ hit_{L1\_cache} = 1ns\\ miss_{L2\_cache} = 5\%\\ hit_{L2\_cache} = 10ns\\ miss_{L3\_cache} = 1\%\\ hit_{L3\_cache} = 50ns

Result average access time:

no cache: AAT = 100ns\\ cache L1: AAT = 1ns + 10\% * 100ns = 11ns\\ cache L2: AAT = 1ns + 10\% * (10ns + 5\% * 100ns) = 2.5ns\\ cache L3: AAT = 1ns + 10\% * (10ns + 5\% * (50ns + 1\% * 100ns)) = 2.255ns

x44

Memory access cost

Latency Comparison Numbers (~2012)
+ "World constants 2022" from Andrey Aksenov
----------------------------------
L1 cache reference                         0.3 ns
Branch mispredict                            5 ns
L2 cache reference                           7 ns
Main memory reference                       25 ns
Mutex lock/unlock                          100 ns
Compress 1K bytes with Zippy             3,000 ns
Send 1K bytes over 1 Gbps network       10,000 ns
Read 1 MB sequentially from memory      66,000 ns
Read 4K randomly from SSD*             150,000 ns
Read 1 MB sequentially from SSD*       333,000 ns
Round trip within same datacenter      500,000 ns
Disk seek                           10,000,000 ns
Read 1 MB sequentially from disk    20,000,000 ns
Send packet CA->Netherlands->CA    150,000,000 ns

Help to cache

struct complex_struct {
	int id;
	double a;
	long d;
	char buf[10];
	char *long_buf;
};

struct complex_struct *
complex_struct_bad_new(int long_buf_len)
{
	struct complex_struct *ret =
		(struct complex_struct *) malloc(sizeof(*ret));
	ret->long_buf = (char *) malloc(long_buf_len);
	return ret;
}

struct complex_struct *
complex_struct_good_new(int long_buf_len)
{
	struct complex_struct *ret;
	int size = sizeof(*ret) + long_buf_len;
	ret = (struct complex_struct *) malloc(size);
	ret->long_buf = (char *) ret + sizeof(*ret);
	return ret;
}

int main()
{
	return 0;
}

This is bad

Structure and its buffer are sequential in memory and in one cache line

*This is good

Main memory [1]

Dynamic Random Access Memory - DRAM

DRAM bit

SRAM bit

  • capacity - femtofarad (10^-15)
  • resistance - teraohm 10^12
  • need reset once per 50-100ms

Main memory [2]

Main memory - main problem. It is slow

What next?

Disks

External devices, transfer data through main memory, accessed via kernel always

Next is virtual memory

Virtual memory [1]

Memory Management Unit - a chip to translate virtual addresses into physical ones

Virtual memory [2]

Physical and virtual memory consists of pages of fixed size (like 4-8 KB)

virt_page

offset

log_{2}{(page\_count)}

MMU:

virtual\_page \Rightarrow physical\_page

Virtual address

// Linux-specific
size = getpagesize();
// Portable.
size = sysconf(_SC_PAGESIZE);

Virtual memory [3]

1.

2.

3.

4.

5.

6.

7.

123

54

68

90

230

170

13

Physical page number

Array index - virtual page number

void *
translate(void *virt)
{
    int virt_page = virt >> offset_bits;
    int phys_page = page_table[virt_page];
    int offset = virt ~ (virt_page << offset_bits);
    return (phys_page << offset_bits) | offset;
}

One huge hardware table? - too expensive

Everything in main memory? - too slow

Page table

Virtual memory [4]

How to solve MMU and page table lookup speed problem?

Don't know what to do? - add a cache

1 point

Virtual memory [5]

1.

2.

3.

4.

5.

6.

7.

123

54

68

90

230

170

13

Page table

187.

232.

34.

48.

519.

94.

58

123

54

68

90

230

170

13

TLB

Translation Lookaside Buffer

std::vector<unsigned>
std::map<unsigned, unsigned>

Hardware, < 100 records. It is a core part of MMU.

Programmatic, thousands and millions of records

Processor loads needed pages in advance, before they are accessed. It can be helped with

__builtin_prefetch

Content Addressable Memory - CAM

Virtual memory [6]

Page global directory

Page middle directory

Page table entry

L1_idx

Virtual address

L2_idx

offset

L3_idx

Virtual memory [7]

How to share TLB between processes?

Address Space Identifier, ASID. It is an implicit prefix of each address in the TLB, and each process has ASID.

1 point

Virtual memory [8] in the kernel

/**
 * 16.10.2018
 * 138 lines.
 */
struct page {
	unsigned long flags;
	void *virtual;
        struct list_head lru;
};

Page tables are maintained by the kernel

Virtual memory [9]

0x0

0xffffffff

.text

.data

.bss

.heap

.stack

.env

0xс0000000

.kernel

Virtual memory [10]

KASLR - Kernel Address Space Layout Randomization

KAISER - Kernel Address Isolation to have Side-channels Efficiently Removed

Fight against errors in the kernel code

Fight against attacks like Meltdown - with an implicit channel leaking information

Virtual memory [11]

#define __builtin_prefetch
#define __builtin_expect

int
madvise(void *addr, size_t len, int advice);

Cache and physical memory concepts are hidden from user, except for some small bits of info

But virtual memory management is fully accessible in user space

void *
mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);

void *
brk(const void *addr);

void *
alloca(size_t size);

void *
malloc(size_t size);

Malloc

32MB

32MB

16MB

16MB

16MB

16MB

8MB

8MB

8MB

8MB

8MB

8MB

8MB

8MB

malloc(size);

Rounds up to the nearest block size, fills headers, returns

Alternatives to malloc

  • jemalloc
  • tcmalloc

Summary

Memory has many layers - CPU registers, CPU cache, main (RAM)), virtual addresses, abstractions like heap

Virtual addresses are important. All processes have the same address space, but different mappings to the physical memory. Same virtual address in 2 processes always* points at 2 different physical bytes.

Kernel manages memory in pages. Both physical and virtual. For each process it stores a map like {virtual page address -> physical page address}. Page address points at its first byte.

Hardware translates the addresses on each memory access. Special device TLB uses a cached subset of the page mapping. Sometimes it falls back to the kernel for pages missing in the cache.

Heap, stack, mmap also operate on virtual addresses and whole pages. They are just abstractions on top of the virtual address space.

Processes, memory. Practice

Shell

Write a simplified version of a command line console. It should read lines like this:

 

    > cmd1 arg arg | cmd2 arg | cmd3 arg arg arg ...

 

and execute them, just like a normal console, like 'bash'. Use pipe() + dup() + fork() + exec().

Points: 15 - 25.

Deadline: 2 weeks.

Penalty: -1 for each day after deadline, max -10

Publish your solution on Github and give me the link. Assessment: any way you want - messengers, calls, emails.

Conclusion

Next time:

Signals. Hardware and programmatic interrupts, their nature. Top and bottom halves. Signals and system calls, signal context.


Press on the heart, if like the lecture

System programming 3

By Vladislav Shpilevoy

System programming 3

Virtual and physical memory. Cache, cache line, cache levels, cache coherence, false sharing. High and low memory. Page tables. User space and kernel space memory, layout. Functions brk, madvice, mmap. Malloc and alternative libraries.

  • 1,784