ENPM809V
Linux Kernel Internals - Part 1
Some Resources to Look At
- Bootlin Elixir - Contains the source code
- Will be referenced in the slides
- Userspace Documentation
- Some concepts are very similar (especially during synchronization)
What we will be learning
- Linux Kernel Fundamentals
- Linux Kernel Modules
- System Calls in the Kernel
- Interrupt Handling
- Kernel Threads
Linux Kernel Fundamentals
What is the Kenrel?
- Code in the operating system that interfaces between hardware and higher-level applications.
- The Linux kernel is a free-open source operating system in Linux Distributions
- Modular, monolithic, multitasking, Unix-Like
Application
Application
Application
Application
System Call Interface/Interrupt Handling
Kernel Subsystem
Device Drivers
Application
Application
Application
Application
System Call Interface/Interrupt Handling
Kernel Subsystem
Device Drivers
x86 Protection Rings
- An protection mechanism in x86_64 CPUs to prevent unauthorized access to the kernel.
- 3 protection rings (but mostly use level 0 and 3)
- Level 0 = Kernel and Drivers
- Level 3 = Applications
x86 Protection Rings
- At ring 3, the CPU can
- Use most x86 instructions
- Access unprivileged memory
- At ring 0, the CPU can
- Do almost everything at ring 3
- Access Privileged memory
- Use Special instructions
Switching Protection Rings
- Userspace programs can ask the kernel to execute something through a few vectors:
- System calls - occurs by calls directly from userspace applications
- Interrupts - occurs indirectly through the use of instruction that cause exceptional conditions
System Calls
- A userspace program executes the syscall instruction
- How does this happen?
- The address of the instruction following the syscall is placed in to RXC
- RIP is now the Kernel's System call handler
- Provided by the OS at boot time
- Generally stored in the LSTAR register on x86 machines
- Ring level is set to 0 (CPL)
- After the kernel finishes, RIP is set to whatever is in RCX, transferred back to ring 3
Privileged Instructions
- Ring 0 Code has access to privileged instructions
- Reacts to how the system reacts to interrupts/exceptions
- LIDT - Load Interrupt Descriptor Table Register
- LLDT - Load Local Descriptor Table
- LGDT - Load Global Descriptor Table Register
- LTR - Load Task Register
- Reading/Writing Mahcine-specific registers
- RDMSR, WRMSR
- Virtual machine opcodes
- VMCALL, VMLAUNCH, VMRESUME, VMXON, VMXOFF
- Others too...
- Reacts to how the system reacts to interrupts/exceptions
Kernel Data Structures
Many Many Structures
- Structures contain data for the majority of kernel data
- Tasks
- Kthreads
- Audit
- Files
Many Many Structures
- Tend to be generalized so that it can be applied anywhere without sacrificing performance
- Linked lists - /include/linux/list.h
- Queues - /include/linux/kfifo.h
- Hash maps - /include/linux/hashtable.h
- Radix trees - /include/linux/generic-radix-tree.h
- RB trees - /include/linux/rbtree.h
Slightly Different Than Traditional Datastructures
DATA
Prev
Next
DATA
Prev
Next
DATA
Prev
Next
Slightly Different Than Traditional Datastructures
DATA
Prev
Next
DATA
Prev
Next
typedef struct list_head
{
struct list_head *prev;
struct list_head *next;
};
struct some_other_struct
{
char *data1;
int data2;
struct list_head *head;
}
https://www.oreilly.com/library/view/linux-device-drivers/0596000081/ch10s05.html
Embedding Structures
typedef struct example_struct
{
struct example_struct *prev;
struct example_struct *next;
};
struct some_other_struct
{
char *data1;
int data2;
struct example_struct *head;
};
- Embedding structures is quite common in the Linux Kernel
- task -> file
- task -> audit
- task -> another task
- Structures can also be randomized
- Security Feature __randomize_layout
Struture Randomization
typedef struct example_struct
{
struct example_struct *prev;
struct example_struct *next;
};
struct some_other_struct
{
char *data1;
int data2;
struct example_struct *head;
} __randomize_struct;
- Many structures are randomized at compile time
- Difficult to attack based on offset
- This is where macros come in
offsetof()
- Finds the offset of a member given a structure type
- This is defined as a standard part of C
#define offsetof(a,b) ((int)(&(((a*)(0))->b)))
container_of
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) ); })
- Built-in Macro to determine who the parent structure is
- Takes a pointer to the member (child) structure
- Subtracts the pointer of the member to the offset it is located in the parent definition.
- End result = address to parent
task_struct
- The task_struct is used to manage tasks
- A task is the kernel's way of managing processes/execution context
- Contains MANY data fields ranging from memory, CPU Usage, or other data structures (such as the audit_context)
- Also contains information like UID and EUID
- Access the task_struct of the currently running process by using the macro - current
- Located in /include/linux/shed.h
Scheduler Classes
What is it?
- A way for the Linux Kernel to manage task execution
- Module - allowing different algorithms to operate a scheduler
- Each scheduler class runs a different type of process/task
- Base implementation: /kernel/sched/core.c
- It is tracked in the task_struct - sched_entity field
Completely Fair Scheduler
- Responsible for scheduling processes of normal priority
- Provides processes with a proportion of CPU time
- Aims to maximize overall CPU time
- Implemented based on per-CPU run queues
- Nodes are ordered in a time-based manner
- Kept sorted by red-black trees
Red-Black Trees in Completely Fair Scheduler
- Objective: Keep track of how long a process has been running (part of the completely-fair algorithm)
- Red-Black Trees are binary search trees, but totally balanced
- Tracked in nano-seconds by vruntime field
- How it is performed
- Insert tasks into the tree based on vruntime
- Pick the one with the smallest vruntime
- During context switching, update the vruntime (increasing it by the time elapsed)
- Put it back into the tree
https://www.geeksforgeeks.org/introduction-to-red-black-tree/
Red-Black Trees
2
13
22
15
19
9
8
6
Red-Black Trees
2
13
22
15
19
9
8
6
How is it vrruntime calculated?
- New Tasks -
newvruntime = minimum_vruntime
- After execution
newvruntime = time_elapsed * niceness
- Niceness is based on priority
How is it invoked?
- /kernel/core/sched.c
- schedule - the main function
- Chooses what task to run and performs context switching
- Also updates vruntime
- Can be invoked in a few ways
- update_process_times
- Kernel Threads/Drivers calling the schedule function
- Preemptively by the kernel
- Being called explicitly
Kernel Threads
What are they?
- Kernel threads are tasks. As such they run in their own context
- API can be found in /include/linux/kthread.h
- Has functions like kthread_create
- Kernel threads can only be created by other kernel threads
- We can track kernel threads through the task_struct
- Can you figure out how/why?
API Calls
- kthread_create - creates a new kernel thread
- wake_up_process - start a kernel thread (or other task)
- do_exit - terminate a kernel thread
- kthread_stop - Flag the kernel thread that it should stop
- It will wake up a sleeping kthread if necessary to set the flag
- kthread_should_stop - check to see if the kernel thread should stop
- allow_signal - indicates that the particular kthread can recieve the indicated signal
- set_current_state - sets the state (TASK_INTERRUPTABLE) makes it interruptable
- schedule/ssleep - give up the CPU
Synchronization
What the Kernel Proides
- Wait Queues - FIFO based on sleep
- Completionn Variables - Sleep until a certain condition is met
- Spinlocks - Very similar to POSIX Spinlocks
- If you don't know what it is man pthread_spin_lock
- Semaphores - Similar to POSIX Semaphores
- man sem_overview
- Atomic Operations
- Mutexes - Similar to POSIX Mutexes
What the Kernel Proides
- Wait Queues - /include/linux/wait.h
- Completion Variables - include/linux/completion
- Spinlocks - /include/linux/spinlock.h
- Semaphores - /include/linux/semaphore.h
- Atomic Operations
- /include/linux/types.h (for types)
- /include/asm-generic/atomic-instrumented (operations)
- Mutexes - /include/linux/mutex.h
- We are not going to go over these in depth, you need to do your homework on this.
Interrupts
x86 Interrupt Handling
-
Interrupt: A "signal" that stops the current process as it is and does something else.
- Identified by an interrupt vector number (between 0 and 256)
- Can be software and hardware based
- Hardware Interrupts managed by the Advanced Programmable Interrupt Controller (APIC)
- Programmable interrupt controllers developed by Intel
- Receives a signal from hardware device - says something needs to be done through a signal
- Redirects it to the correct system interrupt (Programmable piece)
The Basics
- Asynchronous/hardware interrupts
- CPU Timer Expires
- User presses key on keyboard
- Network Card Receives data
- Synchronous/software interrupts
- Errors (Divide By Zero, etc).
- Page Faults
- Interrupt instruction (like int 3)
- What is int 3?
Types of Synchronous Interrupts
- What kind of interrupt is an
int 3
instruction? - Traps - Pauses execution of a program. Generally executed after an instruction.
- Preserves program continuity (breakpoint)
- Fault - An error happens, but can possibly be corrected
- State is saved and processor restores state to where it was before faulting via the interrupt handler
- Aborts - Unrecoverable error - program exits after interrupt handler runs.
x86 Interrupt Handling
- Once it receives it, it raises the interrupt line for a CPU
- This CPU must not be masking the interrupt
- The CPU then stops what is doing and handles the interrupt
- Checks the interrupt vector number
- Executes the interrupt handling code based on the interrupt descriptor table
- After execution is completed, it informs the APIC via the out instruction
x86 Interrupt Handling
Some things to note:
- The CPU saves the state of the running program if an interrupt has occurred on the stack
- Sets RIP to an address on the interrupt descriptor table (calculated by interrupt vector number
What is an Interrupt Descriptor Table?
- A function table containing code to handle various interrupts
- Mapped by interrupt vector number
- Set at kernel boot time via the lidt instruction.
- Contains one operand: a structure containg size and starting address of the IDT
- Informs the CPU how big the IDT is and where it is located
Linux Interrupt Handling
- On bootup, the kernel initializes a global variable called idt_table with the proper gates
- During cpu_init, the kernel calls load_current_idt, which calls load_idt, which in turn executes the lidt instruction
- When the kernel's interrupt handlers are invoked they run in the ring level specified in the given interrupt entry in the IDT
- After an interrupt handler runs, it terminates in an iret instruction, which restores state for the code that are interrupted
- Will continue a little more later...
Into the weeds of Interrupts
Interrupt Descriptor Table
- CPU reads an interrupt descriptor table to determine how to handle interrupts
- Reference: /arch/x86/kernel/idt.c
- Look at
def_idts, apic_idts, idt_table
- Entries are of type
idt_data
- Not what the CPU Uses
- Look at
- Linux converts idt_data into correct format for the CPU
-
idt_init_desc
converts a singleidt_data
to agate_desc
-
gate_desc
is the format x86 CPU Wants
-
Interrupt Descriptor Table
- First 32 entries are reserved for exceptions
- The other interrupt vectors are usable by external IRQs
- Can be mapped to any interrupt vector greater than 31
Interupt Handlers
- Often known as Interrupt Service Routines (ISRs)
- Functions invoked from receiving an interrupt
- Perform any computation or processing needed to handle the interrupt
- Can you think of any examples?
- Handling a keystroke
- They shouldn't block or do a lot of processing
Interupt Handlers
- There is a common handler called common_interrupt
- Shared by all IRQ interrupts
- common_interrupt calls do_IRQ, which finds the right Interrupt handler on the vector and calls it
- Some important notes:
- Interrupt vectors 0-31 share some macro code
- All are distinct handlers
- Actual interrupt handlers referenced in the IDT are defined in /arch/x86/entry/entry_64.S
- Interrupt vectors 0-31 share some macro code
How do interrupts work?
- The task switches context to the interrupt context
- This is where interrupts and their respective handlers can operate
- In the interrupt context, all other interrupts are still enabled (can have two interrupts happen at once).
- This can be disabled by the programmer
- Interrupt Handlers operate in its own context
- Have their own stack (very small - one page)
How do interrupts work?
- For asynchronous interrupts: the device sends a signal to the interrupt controller on the CPU
- Lookup signals in the Linux manual
- This is called an Interrupt Request (IRQ)
- Interrupt Controller (in the kernel) monitors IRQ lines
- Interrupt controller sends a signal to the processor
How do interrupts work?
- Based on the IRQ, runs the kernel function defined in the IDT
- The interrupt handler routine (function) runs
- Handler exits, kernel resumes normal execution
- Also executes ret_from _intr
Programming the APIC
- The APIC routes IRQs to vectors
- Helps to tell the CPU which vector to run
- APIC needs to tell which CPU the interrupt request to be routed to
- APIC is programmed at boot time
- Done by reading/writing various memory-mapped registers
- References:
- /arch/x86-/apic/io_apic.c
- /arch/x86/include/asm/io_apic.h (structure sent to APIC)
- struct IO_APIC_route_entry
Programming the APIC
- Some IRQ numbers are legacy or from standards
- IRQ 1 is the keyboard
Lets See this Visually
- User Presses Key
- Raise IRQ Line (Raise an interrupt)
- Map IRQ to interrupt Vector
- Send vector to local APIC
- Save State, Switch stacks, put interrupt vector on stack
- Call the interrupt handler
- Kenrel tells APIC that interrupt is handled
IO APIC
IO APIC
IO APIC
CPU/IDT
Registering and Handling an interrupt
- Think of it in two phases - Top half and Bottom Half
- Top half refers to the handler & APIC - and it cannot block.
- Must execute briefly so that it doesn't stall the CPU
- Any more processing must be deferred to the bottom half
- These are scheduled
Registering and Handling an interrupt
- What is the bottom half?
- It is where deferred work is handled
- Three ways this is handled
- Softirq
- Tasklets
- Workqueues
Softirq
- Determined statically at compile-time - kernel/softirq.c
- An array that contains NR_SOFTIRQ (10) softirq's, and each one has a particular action
- Can also be observed via /proc/softirqs
- Softirqs need to be re-entrant
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
BLOCK_IOPOLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,
RCU_SOFTIRQ,
NR_SOFTIRQS
};
Softirq
- Can only be executed if raised -
raise_softirq(TIMER_SOFTIRQ)
- How is it executed
- Returning from a hardware interrupt
- Explicitly called by some subsystem or kernel thread
- Extremely time-sensitive processing
From: https://www.oreilly.com/library/view/understanding-the-linux/0596005652/ch04s07.html
Tasklet
- An implementation on top of softirq (particularly HI_SOFTIRQ and TASKLET_SOFTIRQ)
- Operate on a list of tasklets that are initialized and allocated at runtime
- Tasklet can be run only on one CPU at a time
- Important functions:
- tasklet_schedule and tasklet_hi_schedule
- tasklet_init - initialize a tasklet
- tasklet_disable - disables a tasklet
- tasklet_enable - enables a tasklet
- tasklet_kill - deletes a tasklet from the queue
- Tasklet handler definition:
void tasklet_hadnler(unsigned long data);
Work Queues
- Defer interrupt work to a kernel thread by operating in process context
- Can handle synchronization better than softirq and tasklet (waiting on a semaphore, block I/O, etc.)
- To handle worker queues, can create your own kernel thread or from the generic worker threads already created
-
worker_thread
function is used for the kernel worker thread- Puts a thread to sleep until it is woken up to perform work
- Operate on a linked list of
work_struct
-
Work Queues Functions
- See /includ/linux/workqueue.h
-
DECLARE_WORK
orINIT_WORK
- initailize a worker queue (work_struct) -
schedule_work
- don't need to describe this one -
flush_scheduled_work
- wait for work to be done - Others....
The Interrupt Handler Interface
-
request_irq/free_irq
- register and unregister an interrupt handler- Flags used to register a handler
-
IRQF_DISABLED
- Disable all interrupts when this handler executes -
IRQF_SAMPLE_RANDOM
- Use this handler as an entropy source -
IRQF_TIMER
- Processes system timer interrupts -
IRQF_SHARED
- Can be shared by mutliple handlers
-
- Flags used to register a handler
- local_irq_disable/local_irq_enable
- in_interrupt
- in_irq
- local_irq_save/local_irq_restore
- See more details in /linux/interrupt.h
Working with Linux Kernel
What do you need?
- Linux Kernel Source
- Obtainable from apt
- Get it from kernel.org
Compiling the Kernel
- Why would one want to compile the kernel themselves?
- Enabling debugging Features
- Add functionality
- Change Functionality
- Building for a new architecture
- The virtual machine has a customized kernel
- We will compile it once, but not more than that because it takes. a long time to do
Creating our debugging Environment
- We will be spending some time creating our debugging environment
- We will be using VMWare Workstation/Fusion to do this
- We will create a virtual serial port to communicate over
- We will also use dmesg for print statements (quickest way to debug)
- You might need to continue to this at home
- Alternatively, you can use pwn.college in practice mode
Creating our debugging Environment
- The quick and dirty way of doing it - just use dmesg and printf
- The not-so-quick way - compiling a kernel and enabling kernel gdb
Kernel Modules
- The primary way for extending kernel functionality
- Allows for various different functionality within the Linux kernel
- Support a new filesystem
- Implement a device/driver
- Implement a new protocol
- New Scheduling algorithm
Kernel Modules
#include <linux/module.h>
static int __init start(void)
{
printk(KERN_INFO "Hello World!\n");
return 0;
}
static void __exit mod_stop(void)
{
printk(KERN_INFO "Goodbye World\n");
return;
}
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Michael Wittner");
MODULE_DESCRIPTION("Simple Demo.");
module_init(start);
module_exit(mod_stop);
Defines which functions called on load/removal of a kernel module
Macros for licensing and defining init and exit
Where can you find printk messages?
Kernel Modules
# Basic Makefile for Kernel Modules - Kernel module with one C file
obj-m := example.o # Your C file should match the H file
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
# Inserting kernel modules
insmod example.ko optparam1="param" optparam2=2
#Removing modules
rmmod
#If on pwn.college practice mode, do this instead
vm build /path/to/.c/file
vm start
vm connect
#Look at vm --help and vm <command> --help for more details
When a kernel module is inserted...
- The system call sys_init_module is invoked
- The code is copied into memory
- The license is checked
- The symbols used by the module are checked in the kernel symbol table (and resolved if found)
- That symbol must be exported, can find it in /proc/kallsyms
- The module's init function is invoked
Note: to export symbols, use macro EXPORT_SYMBOL
Character Devices
- A kernel I/O method that uses a stream of data
- All operations (reading, writing, etc.) are performed on a per-byte/character basis.
- Accessed through the Linux FS (/dev/ttyXX)
- Acts like a file (have to implement open, read, write for interaction)
Block Device
- Similar to a character device, but performs operations on chunks of data
- Typically powers of two (128, 256, etc.)
- Linux allows block devices to be accessed as a stream of bytes by applications; thus, very similar to character devices
- The kernel interface must be a full block
- Accessed through /dev (e.g. /dev/sda)
- Examples: Disk drive
Network Devices
- Not accessible by the file system - provides interfaces to various networks instead
- Facilitates the transmission and reception of data packets
- Implement a backend for kernel requests for sending and receiving data
Time to build your own Kernel Module!
Homework
Kernel Internals Homework 1
For this homework, you will be creating a kernel module that implements an interrupt handler.
You will need to create an interrupt handler for an IRQ number and share it with another handler. Every time it gets interrupted, a kernel thread should be created where it increases a counter by 5. After completing it, it should print out the value using a deferred work mechanism.
Things you need to keep in mind for this homework:
- How many times it is counting (make sure to remember to add necessary protections)
- Print out the current value of the counter after deferring work.
- How do you see what interrupt handles are already taken?
ENPM809V - Kernel Internals Part 1
By Ragnar Security
ENPM809V - Kernel Internals Part 1
- 96