ENPM809V

Linux Kernel Internals - Part 1

Some Resources to Look At

  • Bootlin Elixir - Contains the source code
    • Will be referenced in the slides
  • Userspace Documentation 
    • Some concepts are very similar (especially during synchronization)

What we will be learning

  • Linux Kernel Fundamentals
  • Linux Kernel Modules
  • System Calls in the Kernel
  • Interrupt Handling
  • Kernel Threads

Linux Kernel Fundamentals

What is the Kenrel?

  • Code in the operating system that interfaces between hardware and higher-level applications.
  • The Linux kernel is a free-open source operating system in Linux Distributions
    • Modular, monolithic, multitasking, Unix-Like

Application

Application

Application

Application

System Call Interface/Interrupt Handling

Kernel Subsystem

Device Drivers

Application

Application

Application

Application

System Call Interface/Interrupt Handling

Kernel Subsystem

Device Drivers

x86 Protection Rings

  • An protection mechanism in x86_64 CPUs to prevent unauthorized access to the kernel.
  • 3 protection rings (but mostly use level 0 and 3)
    • Level 0 = Kernel and Drivers
    • Level 3 = Applications

x86 Protection Rings

  • At ring 3, the CPU can
    • Use most x86 instructions
    • Access unprivileged memory
  • At ring 0, the CPU can
    • Do almost everything at ring 3
    • Access Privileged memory
    • Use Special instructions

Switching Protection Rings

  • Userspace programs can ask the kernel to execute something through a few vectors:
    • System calls - occurs by calls directly from userspace applications
    • Interrupts - occurs indirectly through the use of instruction that cause exceptional conditions

System Calls

  • A userspace program executes the syscall instruction
  • How does this happen?
    1. The address of the instruction following the syscall is placed in to RXC
    2. RIP is now the Kernel's System call handler
      • Provided by the OS at boot time
      • Generally stored in the LSTAR register on x86 machines
    3. Ring level is set to 0 (CPL)
  • After the kernel finishes, RIP is set to whatever is in RCX, transferred back to ring 3

Privileged Instructions

  • Ring 0 Code has access to privileged instructions
    • Reacts to how the system reacts to interrupts/exceptions
      • LIDT - Load Interrupt Descriptor Table Register
      • LLDT - Load Local Descriptor Table
      • LGDT - Load Global Descriptor Table Register
      • LTR - Load Task Register
    • Reading/Writing Mahcine-specific registers
      • RDMSR, WRMSR
    • Virtual machine opcodes
      • VMCALL, VMLAUNCH, VMRESUME, VMXON, VMXOFF
    • Others too...

Kernel Data Structures

Many Many Structures

  • Structures contain data for the majority of kernel data
    • Tasks
    • Kthreads 
    • Audit 
    • Files

Many Many Structures

  • Tend to be generalized so that it can be applied anywhere without sacrificing performance
    • Linked lists - /include/linux/list.h
    • Queues - /include/linux/kfifo.h
    • Hash maps - /include/linux/hashtable.h
    • Radix trees - /include/linux/generic-radix-tree.h
    • RB trees - /include/linux/rbtree.h

Slightly Different Than Traditional Datastructures

DATA

Prev

Next

DATA

Prev

Next

DATA

Prev

Next

Slightly Different Than Traditional Datastructures

DATA

Prev

Next

DATA

Prev

Next

typedef struct list_head 
{
    struct list_head *prev;
    struct list_head *next;
};

struct some_other_struct
{
    char *data1;
    int data2;
    struct list_head *head;
}

https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch10s05.html

Embedding Structures

typedef struct example_struct 
{
    struct example_struct *prev;
    struct example_struct *next;
};

struct some_other_struct
{
    char *data1;
    int data2;
    struct example_struct *head;
};
  • Embedding structures is quite common in the Linux Kernel
    • task  -> file
    • task -> audit
    • task -> another task
  • Structures can also be randomized
    • Security Feature __randomize_layout

Struture Randomization

typedef struct example_struct 
{
    struct example_struct *prev;
    struct example_struct *next;
};

struct some_other_struct
{
    char *data1;
    int data2;
    struct example_struct *head;
} __randomize_struct;
  • Many structures are randomized at compile time
    • Difficult to attack based on offset
    • This is where macros come in

offsetof()

  • Finds the offset of a member given a structure type
  • This is defined as a standard part of C
#define offsetof(a,b) ((int)(&(((a*)(0))->b)))

container_of

#define container_of(ptr, type, member) ({ \
    const typeof( ((type *)0)->member ) *__mptr = (ptr); \
    (type *)( (char *)__mptr - offsetof(type,member) ); })
  • Built-in Macro to determine who the parent structure is
  • Takes a pointer to the member (child) structure
    • Subtracts the pointer of the member to the offset it is located in the parent definition. 
    • End result = address to parent

task_struct

  • The task_struct is used to manage tasks
    • A task is the kernel's way of managing processes/execution context
  • Contains MANY data fields ranging from memory, CPU Usage, or other data structures (such as the audit_context)
  • Also contains information like UID and EUID
  • Access the task_struct of the currently running process by using the macro - current
  • Located in /include/linux/shed.h

Scheduler Classes

What is it?

  • A way for the Linux Kernel to manage task execution
  • Module - allowing different algorithms to operate a scheduler
  • Each scheduler class runs a different type of process/task
  • Base implementation: /kernel/sched/core.c
  • It is tracked in the task_struct - sched_entity field

Completely Fair Scheduler

  • Responsible for scheduling processes of normal priority
  • Provides processes with a proportion of CPU time
  • Aims to maximize overall CPU time
  • Implemented based on per-CPU run queues
    • Nodes are ordered in a time-based manner 
    • Kept sorted by red-black trees

Red-Black Trees in Completely Fair Scheduler

  • Objective: Keep track of how long a process has been running (part of the completely-fair algorithm)
    • Red-Black Trees are binary search trees, but totally balanced
    • Tracked in nano-seconds by vruntime field
  • How it is performed
    • Insert tasks into the tree based on vruntime
    • Pick the one with the smallest vruntime
    • During context switching, update the vruntime (increasing it by the time elapsed)
      • Put it back into the tree

https://www.geeksforgeeks.org/introduction-to-red-black-tree/

Red-Black Trees

2

13

22

15

19

9

8

6

Red-Black Trees

2

13

22

15

19

9

8

6

How is it vrruntime calculated?

  • New Tasks - newvruntime = minimum_vruntime
  • After execution
    • newvruntime = time_elapsed * niceness
    • Niceness is based on priority

How is it invoked?

  • /kernel/core/sched.c
  • schedule - the main function
    • Chooses what task to run and performs context switching
    • Also updates vruntime
  • Can be invoked in a few ways
    • update_process_times
    • Kernel Threads/Drivers calling the schedule function
    • Preemptively by the kernel 
    • Being called explicitly

Kernel Threads

What are they?

  • Kernel threads are tasks. As such they run in their own context
  • API can be found in /include/linux/kthread.h
    • Has functions like kthread_create
  • Kernel threads can only be created by other kernel threads
  • We can track kernel threads through the task_struct
    • Can you figure out how/why? 

API Calls

  • kthread_create - creates a new kernel thread
  • wake_up_process - start a kernel thread (or other task)
  • do_exit - terminate a kernel thread
  • kthread_stop - Flag the kernel thread that it should stop
    • It will wake up a sleeping kthread if necessary to set the flag
  • kthread_should_stop - check to see if the kernel thread should stop
  • allow_signal - indicates that the particular kthread can recieve the indicated signal
  • set_current_state - sets the state (TASK_INTERRUPTABLE) makes it interruptable
  • schedule/ssleep - give up the CPU 

Synchronization

What the Kernel Proides

  • Wait Queues - FIFO based on sleep
  • Completionn Variables - Sleep until a certain condition is met
  • Spinlocks - Very similar to POSIX Spinlocks 
    • If you don't know what it is man pthread_spin_lock
  • Semaphores - Similar to POSIX Semaphores
    • man sem_overview
  • Atomic Operations
  • Mutexes - Similar to POSIX Mutexes

What the Kernel Proides

  • Wait Queues - /include/linux/wait.h
  • Completion Variables - include/linux/completion
  • Spinlocks - /include/linux/spinlock.h
  • Semaphores - /include/linux/semaphore.h
  • Atomic Operations 
    • /include/linux/types.h (for types)
    • /include/asm-generic/atomic-instrumented (operations)
  • Mutexes - /include/linux/mutex.h
  • We are not going to go over these in depth, you need to do your homework on this. 

Interrupts

x86 Interrupt Handling

  • Interrupt: A "signal" that stops the current process as it is and does something else.
    • Identified by an interrupt vector number (between 0 and 256)
    • Can be software and hardware based
  • Hardware Interrupts managed by the Advanced Programmable Interrupt Controller (APIC)
    • Programmable interrupt controllers developed by Intel
    • Receives a signal from hardware device - says something needs to be done through a signal
    • Redirects it to the correct system interrupt (Programmable piece)

The Basics

  • Asynchronous/hardware interrupts 
    • CPU Timer Expires
    • User presses key on keyboard
    • Network Card Receives data
  • Synchronous/software interrupts
    • Errors (Divide By Zero, etc). 
    • Page Faults
    • Interrupt instruction (like int 3)
      • What is int 3? 

Types of Synchronous Interrupts

  • What kind of interrupt is an int 3 instruction?
  • Traps - Pauses execution of a program. Generally executed after an instruction. 
    • Preserves program continuity (breakpoint)
  • Fault - An error happens, but can possibly be corrected
    • State is saved and processor restores state to where it was before faulting via the interrupt handler
  • Aborts - Unrecoverable error - program exits after interrupt handler runs. 

x86 Interrupt Handling

  • Once it receives it, it raises the interrupt line for a CPU
    • This CPU must not be masking the interrupt
  • The CPU then stops what is doing and handles the interrupt
    • Checks the interrupt vector number
    • Executes the interrupt handling code based on the interrupt descriptor table
    • After execution is completed, it informs the APIC via the out instruction

x86 Interrupt Handling

Some things to note:

  • The CPU saves the state of the running program if an interrupt has occurred on the stack
  • Sets RIP to an address on the interrupt descriptor table (calculated by interrupt vector number

What is an Interrupt Descriptor Table?

  • A function table containing code to handle various interrupts
    • Mapped by interrupt vector number
  • Set at kernel boot time via the lidt instruction.
    • Contains one operand: a structure containg size and starting address of the IDT
    • Informs the CPU how big the IDT is and where it is located

Linux Interrupt Handling

  • On bootup, the kernel initializes a global variable called idt_table with the proper gates 
  • During cpu_init, the kernel calls load_current_idt, which calls load_idt, which in turn executes the lidt instruction
  • When the kernel's interrupt handlers are invoked they run in the ring level specified in the given interrupt entry in the IDT
  • After an interrupt handler runs, it terminates in an iret instruction, which restores state for the code that are interrupted
  • Will continue a little more later...

Into the weeds of Interrupts

Interrupt Descriptor Table

  • CPU reads an interrupt descriptor table to determine how to handle interrupts
  • Reference: /arch/x86/kernel/idt.c
    • Look at def_idts, apic_idts, idt_table
    • Entries are of type idt_data - Not what the CPU Uses
  • Linux converts idt_data into correct format for the CPU
    • idt_init_desc converts a single idt_data to a gate_desc
    • gate_desc is the format x86 CPU Wants

Interrupt Descriptor Table

  • First 32 entries are reserved for exceptions
  • The other interrupt vectors are usable by external IRQs
    • Can be mapped to any interrupt vector greater than 31

Interupt Handlers

  • Often known as Interrupt Service Routines (ISRs)
  • Functions invoked from receiving an interrupt
    • Perform any computation or processing needed to handle the interrupt
    • Can you think of any examples?
      • Handling a keystroke
    • They shouldn't block or do a lot of processing

Interupt Handlers

  • There is a common handler called common_interrupt
    • Shared by all IRQ interrupts
  • common_interrupt calls do_IRQ, which finds the right Interrupt handler on the vector and calls it 
  • Some important notes:
    • Interrupt vectors 0-31 share some macro code
      • All are distinct handlers
    • Actual interrupt handlers referenced in the IDT are defined in /arch/x86/entry/entry_64.S

How do interrupts work?

  • The task switches context to the interrupt context
    • This is where interrupts and their respective handlers can operate
    • In the interrupt context, all other interrupts are still enabled (can have two interrupts happen at once).
      • This can be disabled by the programmer
  • Interrupt Handlers operate in its own context
    • Have their own stack (very small - one page)

How do interrupts work?

  • For asynchronous interrupts: the device sends a signal to the interrupt controller on the CPU
    • Lookup signals in the Linux manual
    • This is called an Interrupt Request (IRQ)
  • Interrupt Controller (in the kernel) monitors IRQ lines
  • Interrupt controller sends a signal to the processor 

How do interrupts work?

  • Based on the IRQ, runs the kernel function defined in the IDT
  • The interrupt handler routine (function) runs
  • Handler exits, kernel resumes normal execution 
    • Also executes ret_from _intr

Programming the APIC

  • The APIC routes IRQs to vectors
    • Helps to tell the CPU which vector to run
    • APIC needs to tell which CPU the interrupt request to be routed to
  • APIC is programmed at boot time
    • Done by reading/writing various memory-mapped registers
  • References:
    • /arch/x86-/apic/io_apic.c 
    • /arch/x86/include/asm/io_apic.h (structure sent to APIC)
      • struct IO_APIC_route_entry

Programming the APIC

  • Some IRQ numbers are legacy or from standards
    • IRQ 1 is the keyboard

Lets See this Visually

  1. User Presses Key
  2. Raise IRQ Line (Raise an interrupt)
  3. Map IRQ to interrupt Vector
  4. Send vector to local APIC
  5. Save State, Switch stacks, put interrupt vector on stack
  6. Call the interrupt handler
  7. Kenrel tells APIC that interrupt is handled

IO APIC

IO APIC

IO APIC

CPU/IDT

Registering and Handling an interrupt

  • Think of it in two phases - Top half and Bottom Half
  • Top half refers to the handler & APIC - and it cannot block. 
    • Must execute briefly so that it doesn't stall the CPU
  • Any more processing must be deferred to the bottom half
    • These are scheduled

Registering and Handling an interrupt

  • What is the bottom half? 
    • It is where deferred work is handled
  • Three ways this is handled
    • Softirq
    • Tasklets
    • Workqueues

 

Softirq

  • Determined statically at compile-time - kernel/softirq.c
  • An array that contains NR_SOFTIRQ (10) softirq's, and each one has a particular action 
    • Can also be observed via /proc/softirqs
    • Softirqs need to be re-entrant
enum
{
        HI_SOFTIRQ=0,
        TIMER_SOFTIRQ,
        NET_TX_SOFTIRQ,
        NET_RX_SOFTIRQ,
        BLOCK_SOFTIRQ,
        BLOCK_IOPOLL_SOFTIRQ,
        TASKLET_SOFTIRQ,
        SCHED_SOFTIRQ,
        HRTIMER_SOFTIRQ,
        RCU_SOFTIRQ,
        NR_SOFTIRQS
};

Softirq

  • Can only be executed if raised - raise_softirq(TIMER_SOFTIRQ)
  • How is it executed
    • Returning from a hardware interrupt
    • Explicitly called by some subsystem or kernel thread
  • Extremely time-sensitive processing

From: https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch04s07.html

Tasklet

  • An implementation on top of softirq (particularly HI_SOFTIRQ and TASKLET_SOFTIRQ)
  • Operate on a list of tasklets that are initialized and allocated at runtime
  • Tasklet can be run only on one CPU at a time
  • Important functions:
    • tasklet_schedule and tasklet_hi_schedule
    • tasklet_init - initialize a tasklet
    • tasklet_disable - disables a tasklet
    • tasklet_enable - enables a tasklet
    • tasklet_kill - deletes a tasklet from the queue
    • Tasklet handler definition: void tasklet_hadnler(unsigned long data);

Work Queues

  • Defer interrupt work to a kernel thread by operating in process context
    • Can handle synchronization better than softirq and tasklet (waiting on a semaphore, block I/O, etc.) 
  • To handle worker queues, can create your own kernel thread or from the generic worker threads already created
    • worker_thread function is used for the kernel worker thread
      • Puts a thread to sleep until it is woken up to perform work 
      • Operate on a linked list of work_struct

Work Queues Functions

  • See /includ/linux/workqueue.h
  • DECLARE_WORK or INIT_WORK - initailize a worker queue (work_struct)
  • schedule_work - don't need to describe this one
  • flush_scheduled_work - wait for work to be done
  • Others....

The Interrupt Handler Interface

  • request_irq/free_irq - register and unregister an interrupt handler
    • Flags used to register a handler
      • IRQF_DISABLED - Disable all interrupts when this handler executes
      • IRQF_SAMPLE_RANDOM - Use this handler as an entropy source
      • IRQF_TIMER - Processes system timer interrupts
      • IRQF_SHARED - Can be shared by mutliple handlers
  • local_irq_disable/local_irq_enable
  • in_interrupt
  • in_irq
  • local_irq_save/local_irq_restore
  • See more details in /linux/interrupt.h

Working with Linux Kernel

What do you need?

  • Linux Kernel Source
    • Obtainable from apt
    • Get it from kernel.org

Compiling the Kernel

  • Why would one want to compile the kernel themselves?
    • Enabling debugging Features
    • Add functionality
    • Change Functionality
    • Building for a new architecture
  • The virtual machine has a customized kernel
    • We will compile it once, but not more than that because it takes. a long time to do

Creating our debugging Environment

  • We will be spending some time creating our debugging environment
    • We will be using VMWare Workstation/Fusion to do this
    • We will create a virtual serial port to communicate over 
    • We will also use dmesg for print statements (quickest way to debug)
    • You might need to continue to this at home 
  • Alternatively, you can use pwn.college in practice mode

Creating our debugging Environment

  • The quick and dirty way of doing it - just use dmesg and printf
  • The not-so-quick way - compiling a kernel and enabling kernel gdb

Kernel Modules

  • The primary way for extending kernel functionality
  • Allows for various different functionality within the Linux kernel
    • Support a new filesystem
    • Implement a device/driver
    • Implement a new protocol
    • New Scheduling algorithm 

Kernel Modules

#include <linux/module.h>

static int __init start(void)
{
    printk(KERN_INFO "Hello World!\n");
    return 0; 
}

static void __exit mod_stop(void)
{
    printk(KERN_INFO "Goodbye World\n");
    return;
}

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Michael Wittner");
MODULE_DESCRIPTION("Simple Demo.");
module_init(start);
module_exit(mod_stop); 

Defines which functions called on load/removal of a kernel module

Macros for licensing and defining init and exit

Where can you find printk messages?

Kernel Modules

# Basic Makefile for Kernel Modules - Kernel module with one C file

obj-m := example.o # Your C file should match the H file

all:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
    
clean:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
    
# Inserting kernel modules

insmod example.ko optparam1="param" optparam2=2

#Removing modules

rmmod 

#If on pwn.college practice mode, do this instead
vm build /path/to/.c/file
vm start
vm connect
#Look at vm --help and vm <command> --help for more details

When a kernel module is inserted...

  • The system call sys_init_module is invoked
  • The code is copied into memory
  • The license is checked
  • The symbols used by the module are checked in the kernel symbol table (and resolved if found)
    • That symbol must be exported, can find it in /proc/kallsyms
  • The module's init function is invoked

Note: to export symbols, use macro EXPORT_SYMBOL

Character Devices

  • A kernel I/O method that uses a stream of data
  • All operations (reading, writing, etc.) are performed on a per-byte/character basis.
  • Accessed through the Linux FS (/dev/ttyXX)
  • Acts like a file (have to implement open, read, write  for interaction)

Block Device

  • Similar to a character device, but performs operations on chunks of data
    • Typically powers of two (128, 256, etc.)
  • Linux allows block devices to be accessed as a stream of bytes by applications; thus, very similar to character devices
    • The kernel interface must be a full block
  • Accessed through /dev (e.g. /dev/sda)
  • Examples: Disk drive

Network Devices

  • Not accessible by the file system - provides interfaces to various networks instead
  • Facilitates the transmission and reception of data packets
  • Implement a backend for kernel requests for sending and receiving data

Time to build your own Kernel Module!

Homework

Kernel Internals Homework 1

 

For this homework, you will be creating a kernel module that implements an interrupt handler.

You will need to create an interrupt handler for an IRQ number and share it with another handler. Every time it gets interrupted, a kernel thread should be created where it increases a counter by 5. After completing it, it should print out the value using a deferred work mechanism.

Things you need to keep in mind for this homework:

  • How many times it is counting (make sure to remember to add necessary protections)
  • Print out the current value of the counter after deferring work.
  • How do you see what interrupt handles are already taken?