Command-Line Tools for Data Processing

MLDM 2015.3.16

Hi, I am Chen-Yu

python, bash, SQL mongoDB, d3.js, no  Hadoop yet :)

Data Science is OSMEN

Mason, Wiggins

http://bit.ly/1lS1N8z

Obtaining data

Scrubbing data

Exploring data

Modeling data

iNterpreting data

Data Science is OSEMN

datascienceatthecommandline.com/

What is

Command-Line?

Example

Task

count and print the 10 most common words in the file `alice.txt'

import re
from collections import Counter

with open("alice.txt", "r") as alice:
    text = alice.read()
    words = re.sub(r"[^a-z ]", " ", text.lower()).split()

for word, count in Counter(words).most_common(10):
    print(count, word)
$ python word_count.py 
1818 the
940 and
809 to
690 a
631 of
610 it
553 she
545 i
481 you
462 said
cat alice.txt |
tr -cs A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -nr |
head
$ cat alice.txt
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice's Adventures in Wonderland
...
$ cat alice.txt | tr -cs A-Za-z '\n'
Project
Gutenberg
s
Alice
s
Adventures
in
Wonderland
by

...
$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z
project
gutenberg
s
alice
s
adventures
in
wonderland
by

...
$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
 sort
a
a
a
a
a
a
a
a
a

...
$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
 sort | uniq -c
      1 
    690 a
      2 abide
      1 able
    102 about
      3 above
      1 absence
      2 absurd
      1 accept
      1 acceptance

...
$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
 sort | uniq -c | sort -nr
   1818 the
    940 and
    809 to
    690 a
    631 of
    610 it
    553 she
    545 i
    481 you
    462 said

...
$ cat alice.txt | tr -cs A-Za-z '\n' | tr A-Z a-z |
 sort | uniq -c | sort -nr | head
   1818 the
    940 and
    809 to
    690 a
    631 of
    610 it
    553 she
    545 i
    481 you
    462 said

Why

Command-Line?

Agile

Augmenting

Scalable

Extensible

Ubiquitous

Tools

$ echo Hello World!
Hello World!
$ cat alice.txt
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice's Adventures in Wonderland

...
$ wc alice.txt 
  3735  29461 167518 alice.txt
$ head -3 alice.txt 
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
$ tail -3 alice.txt 
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.
$ grep Alice alice.txt | head
Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll
Title: Alice's Adventures in Wonderland
Alice was beginning to get very tired of sitting by her sister on the
it, 'and what is the use of a book,' thought Alice 'without pictures or
There was nothing so VERY remarkable in that; nor did Alice think it so
Alice started to her feet, for it flashed across her mind that she had
In another moment down went Alice after it, never once considering how
dipped suddenly down, so suddenly that Alice had not a moment to think
'Well!' thought Alice to herself, 'after such a fall as this, I shall
thousand miles down, I think--' (for, you see, Alice had learnt several
$ grep -c Alice alice.txt
396
$ grep Alice alice.txt | wc -l
396
$ awk
$ sed

Pattern or Stream Editor

$ jq

JSON Filter

DEMO!

CSVkit

$ csvclean
$ csvcut
$ csvformat
$ csvgrep
$ csvjoin
$ csvjson
$ csvlook
$ csvpy
$ csvsort
$ csvsql
$ csvstack
$ csvstat

Thanks You!

Command-Line Tools for Data Processing

By Chen-Yu Yang

Command-Line Tools for Data Processing

  • 956