Managing Data Projects Like a Software Engineer
Michael Jalkio
ODSC East
April 17th, 2020





You're a Data Scientist!
When things get hard...
- Your coworker is out, and you have to run their code
- You start a new job and inherit the previous team's work
- You have to refer back to code that you wrote over a year ago
- Someone asks you to review their code

Reproducible
&
Understandable
Dependency Management
Definitions
- Dependency / Package - software published by someone else that you use in your data project, usually language specific
- Environment - where you run your code. Includes your operating system, dependencies, and other tools
Python
- virtualenv
- venv
- pyenv
- pipenv
R
- Packrat
- renv
Language Agnostic
- conda

Getting started with virtualenv
(how all projects
should start!)
mjalkio@Michaels-MacBook-Pro code % pwd
/Users/mjalkio/code
mjalkio@Michaels-MacBook-Pro code % mkdir odsc-east
mjalkio@Michaels-MacBook-Pro code % cd odsc-east
mjalkio@Michaels-MacBook-Pro odsc-east % python --version
Python 2.7.16
mjalkio@Michaels-MacBook-Pro odsc-east % which python
/usr/bin/python
mjalkio@Michaels-MacBook-Pro odsc-east % virtualenv venv
created virtual environment CPython3.8.2.final.0-64 in 267ms
creator CPython3Posix(dest=/Users/mjalkio/code/odsc-east/venv, clear=False, global=False)
seeder FromAppData(download=False, pip=latest, setuptools=latest, wheel=latest, via=copy, app_data_dir=/Users/mjalkio/Library/Application Support/virtualenv/seed-app-data/v1.0.1)
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
mjalkio@Michaels-MacBook-Pro odsc-east % ls
venv
mjalkio@Michaels-MacBook-Pro odsc-east % source venv/bin/activate
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % python --version
Python 3.8.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % which python
/Users/mjalkio/code/odsc-east/venv/bin/python
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package Version
---------- -------
pip 20.0.2
setuptools 46.1.3
wheel 0.34.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install numpy
Collecting numpy
Downloading numpy-1.18.2-cp38-cp38-macosx_10_9_x86_64.whl (15.2 MB)
|████████████████████████████████| 15.2 MB 709 kB/s
Installing collected packages: numpy
Successfully installed numpy-1.18.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package Version
---------- -------
numpy 1.18.2
pip 20.0.2
setuptools 46.1.3
wheel 0.34.2
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % deactivate
mjalkio@Michaels-MacBook-Pro odsc-east % pip list
zsh: command not found: pip
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'numpy' > requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat requirements.txt
numpy
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install -r requirements.txt
Requirement already satisfied: numpy in ./venv/lib/python3.8/site-packages (from -r requirements.txt (line 1)) (1.18.2)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'tensorflow' >> requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat requirements.txt
numpy
tensorflow
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip install -r requirements.txt
...OUTPUT OMITTED
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip list
Package Version
---------------------- -----------
absl-py 0.9.0
astunparse 1.6.3
cachetools 4.1.0
certifi 2020.4.5.1
chardet 3.0.4
gast 0.3.3
google-auth 1.14.0
google-auth-oauthlib 0.4.1
google-pasta 0.2.0
grpcio 1.28.1
h5py 2.10.0
idna 2.9
Keras-Preprocessing 1.1.0
Markdown 3.2.1
numpy 1.18.2
oauthlib 3.1.0
opt-einsum 3.2.1
pip 20.0.2
protobuf 3.11.3
pyasn1 0.4.8
pyasn1-modules 0.2.8
requests 2.23.0
requests-oauthlib 1.3.0
rsa 4.0
scipy 1.4.1
setuptools 46.1.3
six 1.14.0
tensorboard 2.2.1
tensorboard-plugin-wit 1.6.0.post3
tensorflow 2.2.0rc3
tensorflow-estimator 2.2.0rc0
termcolor 1.1.0
urllib3 1.25.8
Werkzeug 1.0.1
wheel 0.34.2
wrapt 1.12.1
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % pip freeze > frozen_requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % cat frozen_requirements.txt
absl-py==0.9.0
astunparse==1.6.3
cachetools==4.1.0
certifi==2020.4.5.1
chardet==3.0.4
gast==0.3.3
google-auth==1.14.0
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.28.1
h5py==2.10.0
idna==2.9
Keras-Preprocessing==1.1.0
Markdown==3.2.1
numpy==1.18.2
oauthlib==3.1.0
opt-einsum==3.2.1
protobuf==3.11.3
pyasn1==0.4.8
pyasn1-modules==0.2.8
requests==2.23.0
requests-oauthlib==1.3.0
rsa==4.0
scipy==1.4.1
six==1.14.0
tensorboard==2.2.1
tensorboard-plugin-wit==1.6.0.post3
tensorflow==2.2.0rc3
tensorflow-estimator==2.2.0rc0
termcolor==1.1.0
urllib3==1.25.8
Werkzeug==1.0.1
wrapt==1.12.1
- Less likely to break
- Good for code you won't run often or that doesn't need updates
Pinned Versions
Open
Versions
- More descriptive of what you actually plan to use
- Encourages upgrades
- Good for well-tested and frequently updated code
Pros and Cons
Can also consider a hybrid!
Version Control
Definitions
- Version control - tracks changes to a set of files over time.
- Commit - captures a set of changes to a set of files. Also known as a revision.
- Branch - a way to track sequences of commits that may share a common root.
- Merge - a way to apply the changes on one branch to another branch.



Do version control for you!
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git init
Initialized empty Git repository in /Users/mjalkio/code/odsc-east/.git/
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
frozen_requirements.txt
requirements.txt
venv/
nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % echo 'venv/**' > .gitignore
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
frozen_requirements.txt
requirements.txt
nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git add requirements.txt frozen_requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git commit -m "Add project requirements"
[master (root-commit) 9e03a0f] Add project requirements
2 files changed, 36 insertions(+)
create mode 100644 frozen_requirements.txt
create mode 100644 requirements.txt
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
nothing added to commit but untracked files present (use "git add" to track)
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git add .gitignore
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git commit -m "Add .gitignore"
[master 29794f1] Add .gitignore
1 file changed, 1 insertion(+)
create mode 100644 .gitignore
Tips for version control
- Make small, frequent commits
- Using a GUI can improve the experience
- Learn the basics and embrace Google
How to Write a Git Commit Message
(venv) mjalkio@Michaels-MacBook-Pro odsc-east % git log
commit 29794f1e4e34480b5b12350703ed1e4dbc816182 (HEAD -> master)
Author: Michael Jalkio <mjalkio@gmail.com>
Date: Wed Apr 15 20:36:35 2020 -0700
Add .gitignore
commit 9e03a0fac040d74a9848925eb0332c68d27a788a
Author: Michael Jalkio <mjalkio@gmail.com>
Date: Wed Apr 15 20:33:19 2020 -0700
Add project requirements
mjalkio@Michaels-MacBook-Pro odsc-east % git blame requirements.txt
^9e03a0f (Michael Jalkio 2020-04-15 20:33:19 -0700 1) numpy
^9e03a0f (Michael Jalkio 2020-04-15 20:33:19 -0700 2) tensorflow
You should love these commands
- When was a change made?
- Why was a change made?
- Can I look up additional context?

Coding Standards

├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
├── setup.py <- Make this project pip installable with `pip install -e`
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
My favorite parts of cookiecutter
- Easily jump into a new project
- Encourages immutable data
- Separates exploratory notebooks from code that produces final results
Other standards to set
- Use a linter for consistent coding styles (PEP 8)
- Please establish a SQL standard for your team
- Git commits...for your team!
- Artifact naming
Takeaways
- Use a dependency manager so that your environment is reproducible
- Use version control so that your project's history is traceable
- Use coding standards to make it easy to navigate your projects
Thank you!
Want to continue the conversation? You can find me on LinkedIn.
Slides available at https://slides.com/mjalkio/managing-data-projects-like-a-software-engineer
Managing Data Projects Like a Software Engineer
By mjalkio
Managing Data Projects Like a Software Engineer
Presentation for ODSC East 2020 https://odsc.com/speakers/managing-data-projects-like-a-software-engineer/
- 1,465