Best practices for research code

Gregor Lenz

20.7.2022

Research code

Method development
Experiment setups
Libraries
Datasets
Paper implementations

I don't have time for this...

I'm the only one working on this and I know my way around this mess!

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
No one else is going to use my code.

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
~~No one else is going to use my code.~~
You might make someone else's life easier.

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
~~No one else is going to use my code.~~
You might make someone else's life easier.
I'm just exploring this thing quickly that will take 2 days MAXIMUM...

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
~~No one else is going to use my code.~~
You might make someone else's life easier.
~~I'm just exploring this thing quickly that will take 2 days MAXIMUM...~~
2 weeks later...

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
~~No one else is going to use my code.~~
You might make someone else's life easier.
~~I'm just exploring this thing quickly that will take 2 days MAXIMUM...~~
2 weeks later...
I've got a deadline coming up!

I don't have time for this...

~~I'm the only one working on this and I know my way around this mess!~~
You are never the only one...
~~No one else is going to use my code.~~
You might make someone else's life easier.
~~I'm just exploring this thing quickly that will take 2 days MAXIMUM...~~
2 weeks later...
I've got a deadline coming up!
Fine, but clean it up afterwards

Points covered today

Reproducible environments
Clean code
Testing your code
Documenting your code

Reproducible environments

Working with Git
Set up a virtual environment
Create a project skeleton
Install project package

Git history

Commit often
Commit atomic work + meaningful commit message
Makes your work trackable to
- generate reports
- release notes
- undo changes

avoid

prefer

Git history

Use .gitignore to exclude unwanted files
Avoid pushing large files ('blobs')
Every new user will download entire history of all commited files

https://github.com/github/gitignore

Set up virtual environment

Many available: conda, pipenv, venv, virtualenv, docker, ...

For Conda:
1. Create new environment for project

2. Install as many packages as possible via Conda
3. Then switch to pip

Export your environment.yml regularly
(if it changes)

Project structure

├── data
├── docs
├── results
├── scripts
├── src
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

please avoid

Project structure

├── data
├── docs
├── results
├── scripts
├── src
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

Datasets, do not include in git
Your documentation
Plots, tables, reports
Notebooks, experiments
Your implementation
Yes you need them
Files that git should ignore
For reproducible results
Your project description
For local development

Install project package

├── data
├── docs
├── results
├── scripts
├── my_experiment.py
├── src
├── awesome_layer.py
├── tests
├── .gitignore
├── environment.yml
└── README.md

from ..src.awesome_layer import AwesomeLayer

layer = AwesomeLayer()

$ python scripts/my_experiment.py

  Traceback:
  tests/test_layer.py:2: in <module>
      from ..src.awesome_layer import AwesomeLayer
  E   ImportError: attempted relative import with no known parent package

import sys
sys.path.append('/home/me/Documents/codebook/src')

Install project package

from ..src.awesome_layer import AwesomeLayer

layer = AwesomeLayer()

$ python scripts/my_experiment.py

  Traceback:
  tests/test_layer.py:2: in <module>
      from ..src.awesome_layer import AwesomeLayer
  E   ImportError: attempted relative import with no known parent package

import sys
sys.path.append('/home/me/Documents/codebook/src')

Install project package

├── data
├── docs
├── results
├── scripts
├── my_experiment.py
├── src
├── __init__.py
├── awesome_layer.py
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

from src import AwesomeLayer

from setuptools import find_packages, setup

setup(
    name="src",
    packages=find_packages(),
)

from .awesome_layer import AwesomeLayer

Install project package

$ pip install .

Install copy of current package

$ pip install -e .

Install link to current package

Install project package

No brittle sys.path('../src') imports
No repetitive re-loads needed
Your package can now be used system-wide

Questions?

Clean code

Code formating
Code smells

“Does it spark joy?”

- Marie Kondo

“Any color you like.”

Code formatting

from seven_dwwarfs import Grumpy, Happy, Sleepy, Bashful, Sneezy, Dopey, Doc
x = {  'a':37,'b':42,

'c':927}

x = 123456789.123456789E123456789

if very_long_variable_name is not None and \
 very_long_variable_name.field > 0 or \
 very_long_variable_name.is_debug:
 z = 'hello '+'world'
else:
 world = 'world'
 a = 'hello {}'.format(world)
 f = rf'hello {world}'
if (this
and that): y = 'hello ''world'#FIXME: https://github.com/psf/black/issues/26
class Foo  (     object  ):
  def f    (self   ):
    return       37*-2
  def g(self, x,y=42):
      return y
def f  (   a: List[ int ]) :
  return      37-a[42-u :  y**3]
def very_important_function(template: str,*variables,file: os.PathLike,debug:bool=False,):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
     ...

from seven_dwwarfs import Grumpy, Happy, Sleepy, Bashful, Sneezy, Dopey, Doc

x = {"a": 37, "b": 42, "c": 927}

x = 123456789.123456789e123456789

if (
    very_long_variable_name is not None
    and very_long_variable_name.field > 0
    or very_long_variable_name.is_debug
):
    z = "hello " + "world"
else:
    world = "world"
    a = "hello {}".format(world)
    f = rf"hello {world}"
if this and that:
    y = "hello " "world"  # FIXME: https://github.com/psf/black/issues/26


class Foo(object):
    def f(self):
        return 37 * -2

    def g(self, x, y=42):
        return y


def f(a: List[int]):
    return 37 - a[42 - u : y**3]


def very_important_function(
    template: str,
    *variables,
    file: os.PathLike,
    debug: bool = False,
):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
        ...

Before

After

Run Black manually
pre-commit hooks
Use built-in IDE (PyCharm etc.) support

Code formatting

$ pip install black
$ black ./my_source_folder

$ pip install pre-commit
...
$ pre-commit sample-config
...
$ pre-commit install
...
$ git add --all; git commit -m "my incremental work done"
black..............................................Passed

Code smells

Mysterious variable names
Magic numbers
Duplicated code
Contrived complexity
Uncontrolled side effects
Variable mutations
Large classes
Commented out code
Nested if / for statements

https://testdriven.io/blog/clean-code-python/

Code smells

# Avoid ambiguous variable names
c = 5
d = 12

# Prefer longer variable names
city_counter = 5
elapsed_time_in_days = 12


# Avoid arbitrary shortening of words
self.clc_mem_ptl(spks: torch.Tensor)

# Spell it out
self.calculate_membrane_potential(input_spikes: torch.Tensor)

Ambiguous variable / function names

Code smells

# Avoid indexing of variables that is opaque
def training_step(self, batch: torch.Tensor):
  y_hat = self.network(batch[0])
  loss = criterion(y_hat, batch[1])
  
# Make things explicit
def training_step(self, batch: torch.Tensor):
  input_data, targets = batch
  y_hat = self.network(input_data)
  loss = criterion(y_hat, targets)

Magic numbers

Code smells

# one function does two things
def fetch_and_display_personnel():
    data = # ...

    for person in data:
        print(person)


# Split it
def fetch_personnel():
    return # ...

def display_personnel(data):
    for person in data:
        print(person)

personnel_data = fetch_personnel()
display_personnel(personell_data)

Strongly coupled code

Code smells

# isolated mega function
def render_blog_post(title, author, created_timestamp, updated_timestamp, content):
    # ...

render_blog_post("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")


# See if you can abstract away some things into a separate class
class BlogPost:
    def __init__(self, title, author, created_timestamp, updated_timestamp, content):
        self.title = title
        self.author = author
        self.created_timestamp = created_timestamp
        self.updated_timestamp = updated_timestamp
        self.content = content

blog_post1 = BlogPost("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")

def render_blog_post(blog_post):
    # ...

render_blog_post(blog_post1)

Many arguments in function

Code smells

# isolated mega function
def render_blog_post(title, author, created_timestamp, updated_timestamp, content):
    # ...

render_blog_post("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")


# See if you can abstract away some things into a separate class
class BlogPost:
    def __init__(self, title, author, created_timestamp, updated_timestamp, content):
        self.title = title
        self.author = author
        self.created_timestamp = created_timestamp
        self.updated_timestamp = updated_timestamp
        self.content = content

blog_post1 = BlogPost("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")

def render_blog_post(blog_post):
    # ...

render_blog_post(blog_post1)

Many arguments in function

Code smells

# unnecessary complexity
names = ["Fang", "Debra", "Pascal"]

full_names = []
for i in range(len(names)):
  name = names[i] + " Wang"
  full_names.append(name)
  
  
# instead use built-in iterators and list comprehensions
names_list = ["Fang", "Debra", "Pascal"]

full_names_list = [name + " Wang" for name in names_list]

Un - Pythonic code

Questions?

Test your code

The first few items in the Fibonacci sequence are:

F(x) \equiv F(x-1) + F(x - 2) \newline F(0) \equiv 0 \newline F(1) \equiv 1

F = 0,1,1,2,3,5,8,13,21,...

We implement the following functions:

Test your code

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

├── data
├── docs
├── results
├── scripts
├── src
├── __init__.py
├── fibonacci.py
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Test your code

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Test your code

from src.fibonacci import fibonacci


def test_fibonacci_0():
    assert fibonacci(0) == 0

def test_fibonacci_1():
    assert fibonacci(1) == 1

def test_fibonacci_2():
    assert fibonacci(2) == 1

def test_fibonacci_6():
    assert fibonacci(6) == 8

def test_fibonacci_40():
    assert fibonacci(40) == 102334155

├── data
├── docs
├── results
├── scripts
├── src
├── __init__.py
├── fibonacci.py
├── tests
├── test_fibonacci.py
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

Test name must begin with test_
Use asserts to check validity

Automated unit tests

Test your code

$ pytest tests/test_fibonacci.py

...
	def test_fibonacci_0():
>       assert fibonacci(0) == 0
E       assert 1 == 0
E        +  where 1 = fibonacci(0)

tests/test_layer.py:6: AssertionError
========= short test summary info ============
FAILED tests::test_fibonacci_0 - assert 1 == 0
====== 1 failed, 4 passed in 0.09s ===========

Should not take more than a few seconds
https://docs.pytest.org

We use pytest to run all tests at once

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x == 0:
        return 0
    if x == 1:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x == 0:
        return 0
    if x == 1:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========

Test your code

from src.fibonacci import fibonacci


def test_fibonacci_0():
    assert fibonacci(0) == 0

def test_fibonacci_1():
    assert fibonacci(1) == 1

def test_fibonacci_2():
    assert fibonacci(2) == 1

def test_fibonacci_6():
    assert fibonacci(6) == 8

def test_fibonacci_8():
    assert fibonacci(8) == 21

Refactoring the tests themselves

Test your code

from src.fibonacci import fibonacci
import pytest


@pytest.mark.parametrize(
  "n, correct_output", 
  [(0, 0), (1, 1), (2, 1), (6, 8), (8, 21)]
)
def test_fibonacci_output_is_correct(n, correct_output):
    assert fibonacci(n) == correct_output

Refactoring the tests themselves

$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========

Test your code

$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========

>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Automated test suite

Test your code

Automated testing helps to maintain your sanity
Refactoring code becomes easier
When you find a bug -> add a test to make sure it doesn't appear again

Questions?

Document your code

The code is your documentation - don't rely on comments only
Add helpful docstrings to classes and methods
Design for readability
Code is read many more times than it's written!

Document your code

Many ways to comment your code

# synaptic currents are initialised to zero
i_syn = torch.zeros((batch_size, n_neurons))

# we add the input
i_syn = i_syn + input_data[time_step]

# decaying the synaptic currents
i_syn = i_syn * alpha_syn

Document your code

"Don’t comment bad code - rewrite it. " – Robert Martin in Clean Code

# iterate over all the lines
for l in L:
  # split the line along hyphens
  l.split("-")

for line in lines:
  line.split("-")

Document your code

class CropTime:

  def __init__(self, min=0, max=None):
    self.min = min
    self.max = max

Document your code

Type hints help the user to pass the right parameters

class CropTime:

  def __init__(self, min: int = 0, max: Optional[int] = None):
    self.min = min
    self.max = max

https://blog.logrocket.com/understanding-type-annotation-python/

Document your code

Docstrings explain what your class / method does

class CropTime:
  """Drops events with timestamps below min and above max.
  
  Parameters:
    min (int): The minimum timestamp below which all events are dropped. 
               Zero by default.
    max (int): The maximum timestamp above which all events are dropped.
    
  Example:
    >>> transform = tonic.transforms.CropTime(min=1000, max=20000)
  """

  def __init__(self, min: int = 0, max: Optional[int] = None):
    self.min = min
    self.max = max

Document your code

Docstrings explain what your class / method does

def calc(self, x, tau):
  """
  This method normalises synaptic input currents  
  by the neuron's respective membrane time constant.
  
  Parameters:
    x (torch.Tensor): the synaptic input currents
    tau (torch.Tensor): the tau_mem for each neuron.
    
  Returns:
    torch.Tensor: normalised input currents
  """
  return x * (1-tau)

def calc(self, x, tau):
  # normalise synaptic currents x by tau_mem
  return x * (1-tau)

Document your code

def normalise_i_syn_by_tau(self, i_syn: torch.Tensor, tau_mem: torch.Tensor):
    return i_syn * (1-tau_mem)

def normalise_i_syn_by_tau(self, i_syn: torch.Tensor, tau_mem: torch.Tensor):
    """
    Normalising synaptic input current by the neuron's membrane potential
    helps when training time constants, as the amount of current injected 
    over time is the same.
    """
    return i_syn * (1-tau_mem)

Document your code

Questions?

Some resources

Arjan codes on YouTube

Good research practices book

Things not covered

Getting good estimates
Code reviews
Issue trackers, git flow development cycle

Best practices for research code

By Gregor Lenz

Best practices for research code

Best practices for research code

Research code

I don't have time for this...

I don't have time for this...

I don't have time for this...

I don't have time for this...

I don't have time for this...

I don't have time for this...

I don't have time for this...

I don't have time for this...

Points covered today

Reproducible environments

Git history

Git history

Set up virtual environment

Project structure

Project structure

Install project package

Install project package

Install project package

Install project package

Install project package

Questions?

Clean code

Code formatting

Code formatting

Code smells

Code smells

Code smells

Code smells

Code smells

Code smells

Code smells

Questions?

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Test your code

Questions?

Document your code

Document your code

Document your code

Document your code

Document your code

Document your code

Document your code

Document your code

Document your code

Questions?

Some resources

Things not covered

Best practices for research code

More from Gregor Lenz