Managing Data Projects Like a Software Engineer

Michael Jalkio

ODSC East

April 17th, 2020

You're a Data Scientist!

When things get hard...

  • Your coworker is out, and you have to run their code
  • You start a new job and inherit the previous team's work
  • You have to refer back to code that you wrote over a year ago
  • Someone asks you to review their code

Reproducible
&
Understandable

Dependency Management

Definitions

  • Dependency / Package - software published by someone else that you use in your data project, usually language specific
  • Environment - where you run your code. Includes your operating system, dependencies, and other tools

Python

  • virtualenv
  • venv
  • pyenv
  • pipenv

R

  • Packrat
  • renv

Language Agnostic

  • conda

Getting started with virtualenv

(how all projects
should start!)

mjalkio@Michaels-MacBook-Pro code % pwd
/Users/mjalkio/code
mjalkio@Michaels-MacBook-Pro code % mkdir odsc-east
mjalkio@Michaels-MacBook-Pro code % cd odsc-east 
mjalkio@Michaels-MacBook-Pro odsc-east % python --version
Python 2.7.16
mjalkio@Michaels-MacBook-Pro odsc-east % which python
/usr/bin/python
mjalkio@Michaels-MacBook-Pro odsc-east % virtualenv venv
created virtual environment CPython3.8.2.final.0-64 in 267ms
  creator CPython3Posix(dest=/Users/mjalkio/code/odsc-east/venv, clear=False, global=False)
  seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/Users/mjalkio/Library/Application Support/virtualenv/seed-app-data/v1.0.1)
  activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
mjalkio@Michaels-MacBook-Pro odsc-east % ls
venv
mjalkio@Michaels-MacBook-Pro odsc-east % source venv/bin/activate
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % python --version
Python 3.8.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % which python
/Users/mjalkio/code/odsc-east/venv/bin/python
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package    Version
---------- -------
pip        20.0.2 
setuptools 46.1.3 
wheel      0.34.2 
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install numpy
Collecting numpy
  Downloading numpy-1.18.2-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB)
     |████████████████████████████████| 15.2 MB 709 kB/s 
Installing collected packages: numpy
Successfully installed numpy-1.18.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package    Version
---------- -------
numpy      1.18.2 
pip        20.0.2 
setuptools 46.1.3 
wheel      0.34.2 
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % deactivate
mjalkio@Michaels-MacBook-Pro odsc-east % pip list
zsh: command not found: pip
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'numpy' > requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat requirements.txt 
numpy
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install -r requirements.txt 
Requirement already satisfied: numpy in ./venv/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.18.2)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'tensorflow' >> requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat requirements.txt 
numpy
tensorflow
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install -r requirements.txt 
...OUTPUT OMITTED
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package                Version    
---------------------- -----------
absl-py                0.9.0      
astunparse             1.6.3      
cachetools             4.1.0      
certifi                2020.4.5.1 
chardet                3.0.4      
gast                   0.3.3      
google-auth            1.14.0     
google-auth-oauthlib   0.4.1      
google-pasta           0.2.0      
grpcio                 1.28.1     
h5py                   2.10.0     
idna                   2.9        
Keras-Preprocessing    1.1.0      
Markdown               3.2.1      
numpy                  1.18.2     
oauthlib               3.1.0      
opt-einsum             3.2.1      
pip                    20.0.2     
protobuf               3.11.3     
pyasn1                 0.4.8      
pyasn1-modules         0.2.8      
requests               2.23.0     
requests-oauthlib      1.3.0      
rsa                    4.0        
scipy                  1.4.1      
setuptools             46.1.3     
six                    1.14.0     
tensorboard            2.2.1      
tensorboard-plugin-wit 1.6.0.post3
tensorflow             2.2.0rc3   
tensorflow-estimator   2.2.0rc0   
termcolor              1.1.0      
urllib3                1.25.8     
Werkzeug               1.0.1      
wheel                  0.34.2     
wrapt                  1.12.1     
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip freeze > frozen_requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat frozen_requirements.txt 
absl-py==0.9.0
astunparse==1.6.3
cachetools==4.1.0
certifi==2020.4.5.1
chardet==3.0.4
gast==0.3.3
google-auth==1.14.0
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.28.1
h5py==2.10.0
idna==2.9
Keras-Preprocessing==1.1.0
Markdown==3.2.1
numpy==1.18.2
oauthlib==3.1.0
opt-einsum==3.2.1
protobuf==3.11.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
requests==2.23.0
requests-oauthlib==1.3.0
rsa==4.0
scipy==1.4.1
six==1.14.0
tensorboard==2.2.1
tensorboard-plugin-wit==1.6.0.post3
tensorflow==2.2.0rc3
tensorflow-estimator==2.2.0rc0
termcolor==1.1.0
urllib3==1.25.8
Werkzeug==1.0.1
wrapt==1.12.1
  • Less likely to break
  • Good for code you won't run often or that doesn't need updates

Pinned Versions

Open

Versions

  • More descriptive of what you actually plan to use
  • Encourages upgrades
  • Good for well-tested and frequently updated code

Pros and Cons

Can also consider a hybrid!

Version Control

Definitions

  • Version control - tracks changes to a set of files over time.
  • Commit - captures a set of changes to a set of files. Also known as a revision.
  • Branch - a way to track sequences of commits that may share a common root.
  • Merge - a way to apply the changes on one branch to another branch.

Do version control for you!

(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git init
Initialized empty Git repository in /Users/mjalkio/code/odsc-east/.git/
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	frozen_requirements.txt
	requirements.txt
	venv/

nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'venv/**' > .gitignore
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitignore
	frozen_requirements.txt
	requirements.txt

nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git add requirements.txt frozen_requirements.txt 
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git commit -m "Add project requirements"
[master (root-commit) 9e03a0f] Add project requirements
 2 files changed, 36 insertions(+)
 create mode 100644 frozen_requirements.txt
 create mode 100644 requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	.gitignore

nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git add .gitignore 
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git commit -m "Add .gitignore"
[master 29794f1] Add .gitignore
 1 file changed, 1 insertion(+)
 create mode 100644 .gitignore

Tips for version control

  • Make small, frequent commits
  • Using a GUI can improve the experience
  • Learn the basics and embrace Google

How to Write a Git Commit Message

(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git log
commit 29794f1e4e34480b5b12350703ed1e4dbc816182 (HEAD -> master)
Author: Michael Jalkio <mjalkio@gmail.com>
Date:   Wed Apr 15 20:36:35 2020 -0700

    Add .gitignore

commit 9e03a0fac040d74a9848925eb0332c68d27a788a
Author: Michael Jalkio <mjalkio@gmail.com>
Date:   Wed Apr 15 20:33:19 2020 -0700

    Add project requirements
mjalkio@Michaels-MacBook-Pro odsc-east % git blame requirements.txt 
^9e03a0f (Michael Jalkio 2020-04-15 20:33:19 -0700 1) numpy
^9e03a0f (Michael Jalkio 2020-04-15 20:33:19 -0700 2) tensorflow

You should love these commands

  • When was a change made?
  • Why was a change made?
  • Can I look up additional context?

Coding Standards

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

My favorite parts of cookiecutter

  • Easily jump into a new project
  • Encourages immutable data
  • Separates exploratory notebooks from code that produces final results

Other standards to set

  • Use a linter for consistent coding styles (PEP 8)
  • Please establish a SQL standard for your team
  • Git commits...for your team!
  • Artifact naming

Takeaways

  • Use a dependency manager so that your environment is reproducible
  • Use version control so that your project's history is traceable
  • Use coding standards to make it easy to navigate your projects

Thank you!

Want to continue the conversation? 
You can find me on LinkedIn.

Managing Data Projects Like a Software Engineer

By mjalkio

Managing Data Projects Like a Software Engineer

Presentation for ODSC East 2020 https://odsc.com/speakers/managing-data-projects-like-a-software-engineer/

  • 1,246