Command-Line Tools for Data Processing
MLDM 2015.3.16
Hi, I am Chen-Yu

python, bash, SQL mongoDB, d3.js, no Hadoop yet :)
Data Science is OSMEN
Mason, Wiggins
http://bit.ly/1lS1N8z
Obtaining data


Scrubbing data

Exploring data

Modeling data
iNterpreting data
Data Science is OSEMN
1

datascienceatthecommandline.com/
What is
Command-Line?



Example
Task
count and print the 10 most common words in the file `alice.txt'
import re
from collections import Counter
with open("alice.txt", "r") as alice:
text = alice.read()
words = re.sub(r"[^a-z ]", " ", text.lower()).split()
for word, count in Counter(words).most_common(10):
print(count, word)$ python word_count.py
1818 the
940 and
809 to
690 a
631 of
610 it
553 she
545 i
481 you
462 saidcat alice.txt |
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -nr |
head$ cat alice.txt
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Alice's Adventures in Wonderland
...$ cat alice.txt | tr -cs A-Za-z '\n'
Project
Gutenberg
s
Alice
s
Adventures
in
Wonderland
by
...$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z
project
gutenberg
s
alice
s
adventures
in
wonderland
by
...$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
sort
a
a
a
a
a
a
a
a
a
...$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
sort | uniq -c
1
690 a
2 abide
1 able
102 about
3 above
1 absence
2 absurd
1 accept
1 acceptance
...$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
sort | uniq -c | sort -nr
1818 the
940 and
809 to
690 a
631 of
610 it
553 she
545 i
481 you
462 said
...$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
sort | uniq -c | sort -nr | head
1818 the
940 and
809 to
690 a
631 of
610 it
553 she
545 i
481 you
462 saidWhy
Command-Line?
Agile
Augmenting
Scalable
Extensible
Ubiquitous
Tools

$ echo Hello World!
Hello World!$ cat alice.txt
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever. You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org
Title: Alice's Adventures in Wonderland
...$ wc alice.txt
3735 29461 167518 alice.txt
$ head -3 alice.txt
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
This eBook is for the use of anyone anywhere at no cost and with$ tail -3 alice.txt
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.$ grep Alice alice.txt | head
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
Title: Alice's Adventures in Wonderland
Alice was beginning to get very tired of sitting by her sister on the
it, 'and what is the use of a book,' thought Alice 'without pictures or
There was nothing so VERY remarkable in that; nor did Alice think it so
Alice started to her feet, for it flashed across her mind that she had
In another moment down went Alice after it, never once considering how
dipped suddenly down, so suddenly that Alice had not a moment to think
'Well!' thought Alice to herself, 'after such a fall as this, I shall
thousand miles down, I think--' (for, you see, Alice had learnt several$ grep -c Alice alice.txt
396$ grep Alice alice.txt | wc -l
396$ awk$ sedPattern or Stream Editor
$ jqJSON Filter
DEMO!
CSVkit
$ csvclean
$ csvcut
$ csvformat
$ csvgrep
$ csvjoin
$ csvjson
$ csvlook
$ csvpy
$ csvsort
$ csvsql
$ csvstack
$ csvstatThanks You!
Command-Line Tools for Data Processing
By Chen-Yu Yang
Command-Line Tools for Data Processing
- 956