Reading Large Codebases

Your Programming Experience So Far

We give you

  • A program specification (i.e. what should this program do?)
  • A test suite/test cases
  • Some skeleton code

You give us

 

A few hundred lines of code that pass the tests. You write almost all of this code.

In Practice

Code is written once but read many times.

A lot of programming time is spent understanding existing code and how to interface new code with it.

Example: how do I get the standard library HashMap to do (_____)?

 

Is it possible to get this programming framework to do (________) or do I have to do it myself?

Reading (existing) code is hard

Why?

Existing code has context.

  • Some goal its trying to accomplish
  • Things that it assumes are true
  • Implicitly encodes an appropriate choice of algorithm (e.g. linear vs binary search).
  • Decomposition of problems into smaller pieces
  • This is not all evident in the code!!!

Reading code is hard

void traverse(Tree* root, Function* fptr){
  for(int i = 0; i < root.numChildren; ++i){
    traverse(root->children[i], fptr);
  }
  *fptr(root);
}

Software is

BIG

UTCSH is a big project

Most submissions came in between 400 and 1000 lines of non-comment, non-blank C.

UTCSH

Pintos is a Big project

Pintos has 11.5k lines of C (13.5k if you count headers)

 

Almost 10x as much code!

UTCSH

Pintos

GCC is a BIG project

About 6 million lines of code (600x more)

 

(not to scale)

UTCSH

Pintos

GCC

Linux is....

28 million lines of code and counting

UTCSH

Pintos

GCC

Linux

The techniques you can use to understand and work with a 1k LOC project will not scale to 1m LOC!

In other words, it won't scale to real software projects!

Big Ideas

Big Ideas

  • You cannot hold every piece of the codebase in your head at once.

You cannot hold the entire codebase in your head at once

You cannot hold the entire codebase in your head at once

Corollary: Knowing how to find something can be almost as valuable as knowing it

Understand your system

Understand your system

void traverse(Tree* root, Function* fptr){
  for(int i = 0; i < root.numChildren; ++i){
    traverse(root->children[i], fptr);
  }
  *fptr(root);
}

Always have a goal

question

A good question usually has a few key properties:

  • The answer can be written down in just a few sentences
     
  • Gives you some insight into the context of the system

What's in this code?

Do I need to rewrite this function?

How does this function relate to these data structures?

What does this function assume about its inputs? Are these assumptions valid?

I want to know more about struct Block

I want to know how each of struct Block's members are used in various functions

Other Good Questions

  • What are the primary data structures of this program? What do they represent?
     
  • I want to make a summary version of this documentation that I can use later (cheatsheet)
     
  • What are the main functions that I can call in order to work with <thing>?
     
  • Under what conditions is it safe/unsafe to do X?

Jump In

(Once you're ready)

Reading code is hard

This means it's easy to fool yourself into thinking it's much trickier than it is

Example: Linked Lists