Best practices for research code

Gregor Lenz

20.7.2022

Research code

  • Method development
  • Experiment setups
  • Libraries
  • Datasets
  • Paper implementations

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!

     

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
     

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
    You might make someone else's life easier.

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
    You might make someone else's life easier.
     
  • I'm just exploring this thing quickly that will take 2 days MAXIMUM...
     

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
    You might make someone else's life easier.
     
  • I'm just exploring this thing quickly that will take 2 days MAXIMUM...
    2 weeks later...

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
    You might make someone else's life easier.
     
  • I'm just exploring this thing quickly that will take 2 days MAXIMUM...
    2 weeks later...
     
  • I've got a deadline coming up!

I don't have time for this...

  • I'm the only one working on this and I know my way around this mess!
    You are never the only one...
     
  • No one else is going to use my code.
    You might make someone else's life easier.
     
  • I'm just exploring this thing quickly that will take 2 days MAXIMUM...
    2 weeks later...
     
  • I've got a deadline coming up!
    Fine, but clean it up afterwards

Points covered today

  • Reproducible environments
     
  • Clean code
     
  • Testing your code
     
  • Documenting your code

Reproducible environments

  • Working with Git
     
  • Set up a virtual environment
     
  • Create a project skeleton
     
  • Install project package

Git history

  • Commit often
  • Commit atomic work + meaningful commit message
  • Makes your work trackable to
    • generate reports
    • release notes
    • undo changes

avoid

prefer

Git history

  • Use .gitignore to exclude unwanted files
     
  • Avoid pushing large files ('blobs')
     
  • Every new user will download entire history of all commited files
     

https://github.com/github/gitignore

Set up virtual environment

Many available: conda, pipenv, venv, virtualenv, docker, ...

For Conda:
1. Create new environment for project

2. Install as many packages as possible via Conda
3. Then switch to pip



Export your environment.yml regularly
(if it changes)


 

Project structure

├── data
├── docs
├── results
├── scripts
├── src
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

please avoid

Project structure

├── data
├── docs
├── results
├── scripts
├── src
├── tests
├── .gitignore
├── environment.yml
├── README.md
└── setup.py

Datasets, do not include in git
Your documentation
Plots, tables, reports
Notebooks, experiments
Your implementation
Yes you need them
Files that git should ignore
For reproducible results
Your project description
For local development

 

Install project package

├── data
├── docs
├── results
├── scripts
    ├── my_experiment.py
├── src
    ├── awesome_layer.py
├── tests

├── .gitignore
├── environment.yml
└── README.md

from ..src.awesome_layer import AwesomeLayer

layer = AwesomeLayer()
$ python scripts/my_experiment.py

  Traceback:
  tests/test_layer.py:2: in <module>
      from ..src.awesome_layer import AwesomeLayer
  E   ImportError: attempted relative import with no known parent package
import sys
sys.path.append('/home/me/Documents/codebook/src')

Install project package

├── data
├── docs
├── results
├── scripts
    ├── my_experiment.py
├── src
    ├── awesome_layer.py
├── tests

├── .gitignore
├── environment.yml
└── README.md

from ..src.awesome_layer import AwesomeLayer

layer = AwesomeLayer()
$ python scripts/my_experiment.py

  Traceback:
  tests/test_layer.py:2: in <module>
      from ..src.awesome_layer import AwesomeLayer
  E   ImportError: attempted relative import with no known parent package
import sys
sys.path.append('/home/me/Documents/codebook/src')

Install project package

├── data
├── docs
├── results
├── scripts
    ├── my_experiment.py
├── src
    ├── __init__.py
    ├── awesome_layer.py
├── tests

├── .gitignore
├── environment.yml
├── README.md
└── setup.py

from src import AwesomeLayer
from setuptools import find_packages, setup

setup(
    name="src",
    packages=find_packages(),
)
from .awesome_layer import AwesomeLayer

Install project package

$ pip install .

Install copy of current package

$ pip install -e .

Install link to current package

Install project package

  • No brittle sys.path('../src') imports
     
  • No repetitive re-loads needed
     
  • Your package can now be used system-wide

Questions?

Clean code

  • Code formating
     
  • Code smells

 “Does it spark joy?”

- Marie Kondo

 “Any color you like.”

Code formatting

from seven_dwwarfs import Grumpy, Happy, Sleepy, Bashful, Sneezy, Dopey, Doc
x = {  'a':37,'b':42,

'c':927}

x = 123456789.123456789E123456789

if very_long_variable_name is not None and \
 very_long_variable_name.field > 0 or \
 very_long_variable_name.is_debug:
 z = 'hello '+'world'
else:
 world = 'world'
 a = 'hello {}'.format(world)
 f = rf'hello {world}'
if (this
and that): y = 'hello ''world'#FIXME: https://github.com/psf/black/issues/26
class Foo  (     object  ):
  def f    (self   ):
    return       37*-2
  def g(self, x,y=42):
      return y
def f  (   a: List[ int ]) :
  return      37-a[42-u :  y**3]
def very_important_function(template: str,*variables,file: os.PathLike,debug:bool=False,):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
     ...
from seven_dwwarfs import Grumpy, Happy, Sleepy, Bashful, Sneezy, Dopey, Doc

x = {"a": 37, "b": 42, "c": 927}

x = 123456789.123456789e123456789

if (
    very_long_variable_name is not None
    and very_long_variable_name.field > 0
    or very_long_variable_name.is_debug
):
    z = "hello " + "world"
else:
    world = "world"
    a = "hello {}".format(world)
    f = rf"hello {world}"
if this and that:
    y = "hello " "world"  # FIXME: https://github.com/psf/black/issues/26


class Foo(object):
    def f(self):
        return 37 * -2

    def g(self, x, y=42):
        return y


def f(a: List[int]):
    return 37 - a[42 - u : y**3]


def very_important_function(
    template: str,
    *variables,
    file: os.PathLike,
    debug: bool = False,
):
    """Applies `variables` to the `template` and writes to `file`."""
    with open(file, "w") as f:
        ...

Before

After

  • Run Black manually



     
  • pre-commit hooks



     


  •  
  • Use built-in IDE (PyCharm etc.) support

Code formatting

$ pip install black
$ black ./my_source_folder
$ pip install pre-commit
...
$ pre-commit sample-config
...
$ pre-commit install
...
$ git add --all; git commit -m "my incremental work done"
black..............................................Passed

Code smells

  • Mysterious variable names
  • Magic numbers
  • Duplicated code
  • Contrived complexity
  • Uncontrolled side effects
  • Variable mutations
  • Large classes
  • Commented out code
  • Nested if / for statements

https://testdriven.io/blog/clean-code-python/

Code smells

# Avoid ambiguous variable names
c = 5
d = 12

# Prefer longer variable names
city_counter = 5
elapsed_time_in_days = 12


# Avoid arbitrary shortening of words
self.clc_mem_ptl(spks: torch.Tensor)

# Spell it out
self.calculate_membrane_potential(input_spikes: torch.Tensor)

Ambiguous variable / function names

Code smells

# Avoid indexing of variables that is opaque
def training_step(self, batch: torch.Tensor):
  y_hat = self.network(batch[0])
  loss = criterion(y_hat, batch[1])
  
# Make things explicit
def training_step(self, batch: torch.Tensor):
  input_data, targets = batch
  y_hat = self.network(input_data)
  loss = criterion(y_hat, targets)

Magic numbers

Code smells

# one function does two things
def fetch_and_display_personnel():
    data = # ...

    for person in data:
        print(person)


# Split it
def fetch_personnel():
    return # ...

def display_personnel(data):
    for person in data:
        print(person)

personnel_data = fetch_personnel()
display_personnel(personell_data)

Strongly coupled code

Code smells

# isolated mega function
def render_blog_post(title, author, created_timestamp, updated_timestamp, content):
    # ...

render_blog_post("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")


# See if you can abstract away some things into a separate class
class BlogPost:
    def __init__(self, title, author, created_timestamp, updated_timestamp, content):
        self.title = title
        self.author = author
        self.created_timestamp = created_timestamp
        self.updated_timestamp = updated_timestamp
        self.content = content

blog_post1 = BlogPost("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")

def render_blog_post(blog_post):
    # ...

render_blog_post(blog_post1)

Many arguments in function

Code smells

# isolated mega function
def render_blog_post(title, author, created_timestamp, updated_timestamp, content):
    # ...

render_blog_post("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")


# See if you can abstract away some things into a separate class
class BlogPost:
    def __init__(self, title, author, created_timestamp, updated_timestamp, content):
        self.title = title
        self.author = author
        self.created_timestamp = created_timestamp
        self.updated_timestamp = updated_timestamp
        self.content = content

blog_post1 = BlogPost("Clean code", "Nik Tomazic", 1622148362, 1622148362, "...")

def render_blog_post(blog_post):
    # ...

render_blog_post(blog_post1)

Many arguments in function

Code smells

# unnecessary complexity
names = ["Fang", "Debra", "Pascal"]

full_names = []
for i in range(len(names)):
  name = names[i] + " Wang"
  full_names.append(name)
  
  
# instead use built-in iterators and list comprehensions
names_list = ["Fang", "Debra", "Pascal"]

full_names_list = [name + " Wang" for name in names_list]

Un - Pythonic code

Questions?

Test your code

The first few items in the Fibonacci sequence are:

F(x) \equiv F(x-1) + F(x - 2) \newline F(0) \equiv 0 \newline F(1) \equiv 1
F = 0,1,1,2,3,5,8,13,21,...

We implement the following functions:

Test your code

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

├── data
├── docs
├── results
├── scripts
├── src
    ├── __init__.py
    ├── fibonacci.py
├── tests

├── .gitignore
├── environment.yml
├── README.md
└── setup.py

>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Test your code

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

├── data
├── docs
├── results
├── scripts
├── src
    ├── __init__.py
    ├── fibonacci.py
├── tests

├── .gitignore
├── environment.yml
├── README.md
└── setup.py

>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Test your code

from src.fibonacci import fibonacci


def test_fibonacci_0():
    assert fibonacci(0) == 0

def test_fibonacci_1():
    assert fibonacci(1) == 1

def test_fibonacci_2():
    assert fibonacci(2) == 1

def test_fibonacci_6():
    assert fibonacci(6) == 8

def test_fibonacci_40():
    assert fibonacci(40) == 102334155

├── data
├── docs
├── results
├── scripts
├── src
    ├── __init__.py
    ├── fibonacci.py
├── tests
    ├── test_fibonacci.py

├── .gitignore
├── environment.yml
├── README.md
└── setup.py

  • Test name must begin with test_
  • Use asserts to check validity

Automated unit tests

Test your code

$ pytest tests/test_fibonacci.py

...
	def test_fibonacci_0():
>       assert fibonacci(0) == 0
E       assert 1 == 0
E        +  where 1 = fibonacci(0)

tests/test_layer.py:6: AssertionError
========= short test summary info ============
FAILED tests::test_fibonacci_0 - assert 1 == 0
====== 1 failed, 4 passed in 0.09s ===========

We use pytest to run all tests at once

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x <= 2:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x == 0:
        return 0
    if x == 1:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)

Test your code

Tests provide peace of mind that code still does what it's supposed to do

def fibonacci(x):
    if x == 0:
        return 0
    if x == 1:
        return 1
    else:
        return fibonacci(x - 1) + fibonacci(x - 2)
$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========

Test your code

from src.fibonacci import fibonacci


def test_fibonacci_0():
    assert fibonacci(0) == 0

def test_fibonacci_1():
    assert fibonacci(1) == 1

def test_fibonacci_2():
    assert fibonacci(2) == 1

def test_fibonacci_6():
    assert fibonacci(6) == 8

def test_fibonacci_8():
    assert fibonacci(8) == 21

Refactoring the tests themselves

Test your code

from src.fibonacci import fibonacci
import pytest


@pytest.mark.parametrize(
  "n, correct_output", 
  [(0, 0), (1, 1), (2, 1), (6, 8), (8, 21)]
)
def test_fibonacci_output_is_correct(n, correct_output):
    assert fibonacci(n) == correct_output

Refactoring the tests themselves

$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========

Test your code

$ pytest tests
======= test session starts ========
...
tests/test_fibonacci.py .....  [100%]

======= 5 passed in 0.01s ==========
>>> from src import fibonacci
>>> fibonacci(1)
1
>>> fibonacci(3)
2
>>> fibonacci(0)
1

Manual testing

Automated test suite

Test your code

  • Automated testing helps to maintain your sanity
     
  • Refactoring code becomes easier
     
  • When you find a bug -> add a test to make sure it doesn't appear again

Questions?

Document your code

  • The code is your documentation - don't rely on comments only
     
  • Add helpful docstrings to classes and methods
     
  • Design for readability
     
  • Code is read many more times than it's written!

Document your code

  • Many ways to comment your code
# synaptic currents are initialised to zero
i_syn = torch.zeros((batch_size, n_neurons))

# we add the input
i_syn = i_syn + input_data[time_step]

# decaying the synaptic currents
i_syn = i_syn * alpha_syn

Document your code

"Don’t comment bad code - rewrite it. " –  Robert Martin in Clean Code

# iterate over all the lines
for l in L:
  # split the line along hyphens
  l.split("-")
for line in lines:
  line.split("-")

Document your code

class CropTime:

  def __init__(self, min=0, max=None):
    self.min = min
    self.max = max

Document your code

Type hints help the user to pass the right parameters

class CropTime:

  def __init__(self, min: int = 0, max: Optional[int] = None):
    self.min = min
    self.max = max

https://blog.logrocket.com/understanding-type-annotation-python/

Document your code

Docstrings explain what your class / method does

class CropTime:
  """Drops events with timestamps below min and above max.
  
  Parameters:
    min (int): The minimum timestamp below which all events are dropped. 
               Zero by default.
    max (int): The maximum timestamp above which all events are dropped.
    
  Example:
    >>> transform = tonic.transforms.CropTime(min=1000, max=20000)
  """

  def __init__(self, min: int = 0, max: Optional[int] = None):
    self.min = min
    self.max = max

Document your code

Docstrings explain what your class / method does

def calc(self, x, tau):
  """
  This method normalises synaptic input currents  
  by the neuron's respective membrane time constant.
  
  Parameters:
    x (torch.Tensor): the synaptic input currents
    tau (torch.Tensor): the tau_mem for each neuron.
    
  Returns:
    torch.Tensor: normalised input currents
  """
  return x * (1-tau)
def calc(self, x, tau):
  # normalise synaptic currents x by tau_mem
  return x * (1-tau)

Document your code

def normalise_i_syn_by_tau(self, i_syn: torch.Tensor, tau_mem: torch.Tensor):
    return i_syn * (1-tau_mem)
def normalise_i_syn_by_tau(self, i_syn: torch.Tensor, tau_mem: torch.Tensor):
    """
    Normalising synaptic input current by the neuron's membrane potential
    helps when training time constants, as the amount of current injected 
    over time is the same.
    """
    return i_syn * (1-tau_mem)

Document your code

Questions?

Some resources

Things not covered

  • Getting good estimates
     
  • Code reviews
     
  • Issue trackers, git flow development cycle

Best practices for research code

By Gregor Lenz

Best practices for research code

  • 236