Working with Data in the Unix shell

What is the shell?

The shell is a text based interface to the inner workings of a Unix-based operating system, such as Linux, OS X, and the BSD family, on which OS X itself is based.

Tools we're using today

wget - A HTTP client.

tabula - A PDF data extraction tool.

grep - A powerful search tool.

sed - A text transformation tool.

paste - A tool for matching corresponding lines in text files

gnuplot - A charting tool for the shell.

PIPES! - Easier to demonstrate than to explain.

 

What we hope to learn

Automating repetitive data-related tasks

Getting data out of PDFs (argh)
 

wget

wget comes installed by default on most Unix systems, and if not, is usually available quite easily from the package manager.

 

 

wget http://goo.gl/GYuyXF -O kenya-county-budgets-2013-2014.pdf

tabula

tabula is a pdf data extraction tool. It has a web based graphical interface, as well as a command line interface available as a separate download called tabula-extractor.

We're going to use both.

 

http://tabula.technology

tabula

tabula -a y1,x1,y2,x2 -p 16 kenya-county-budgets-2013-2014.pdf

tabula

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf

So now you have a really messy CSV.

So what?

¯\_(ツ)_/¯

National Revenue ,"",""
a)Equitable Share ,"5,652,000",""
b)Equalization Fund ,"240,000",""
Sub-Totals ,"5,892,000 ",98%
Own Revenue Sources ,"",""
a)Revenue from Local Sources /Subcounties ,"119,031",""
b)Others ,-,""
Sub-Totals ,"119,031 ",2%
"",-,""
Conditional Grants ,-,""
a)Loans/Borrowing ,-,""
b)Contribution in Lieu of Rates ,-,""
c)Donor Funds ,-,""
d)Road Maintenance Levy Fund ,-,""
e)LATF ,-,""
f)Level five hospital ,-,""
Sub-Totals ,- ,0%
Total Revenue ,"6,011,031 ",100%
"",-,""
EXPENDITURE ,-,""
Recurrent Expenditure ,-,""
Recurrent Expenditure ,-,""
a) County  Personnel ,"1,274,706",""
b)County Operations & Maintenance ,"1,482,636",""
Sub-Totals ,"2,757,342 ",40%
National Revenue ,-,""
a)Equitable Share ,"3,037,075",""
b)Equalization Fund ,-,""
Sub-Totals ,"3,037,075 ",100%
Own Revenue Sources ,-,""
a)Revenue from Local Sources /Subcounties ,-,""
b)Others ,-,""
Sub-Totals ,- ,0%
Conditional Grants ,-,""
a)Loans/Borrowing ,-,""
b)Contribution in Lieu of Rates ,-,""
c)Donor Funds ,-,""
d)Road Maintenance Levy Fund ,-,""
e)LATF ,-,""
f)Level five hospital ,-,""
Sub-Totals ,- ,0%
Total Revenue ,"3,037,075 ",100%
EXPENDITURE ,"",""
Recurrent Expenditure,"",""
a)County Assembly ,"362,123",""
a)County Executive ,"1,503,557",""
Sub-Totals ,"1,865,680 ",62%
Development Expenditure ,"",""

grep

whatis grep

> grep(1) - file pattern searcher

grep

whatis grep

> grep(1) - file pattern searcher

whatis whatis

> whatis(1) - search the whatis database for complete words

grep

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total'

grep

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total'


Total Revenue ,"4,443,231 ",100%
Total Revenue ,"4,607,000 ",100%
Total Revenue ,"8,702,780 ",100%
Total Revenue ,"4,429,799 ",100%
Total Revenue ,"3,518,123 ",100%
Total Revenue ,"3,246,694 ",100%
Total Revenue ,"4,847,000 ",100%
Total Revenue ,"5,304,685 ",100%
Total Revenue ,"3,000,000 ",100%
TOTALS ,"4,042,563 ",100%
Total Revenue ,"13,255,550 ",100%
Total Revenue ,"3,532,064 ",100%
Total Revenue ,"12,631,400 ",100%
Total Revenue ,"8,066,673 ",100%
Total Revenue ,"3,267,959 ",100%
Total Revenue ,"29,737,459 ",100%
Total Revenue ,"8,000,000 ",100%
Total Revenue ,"5,954,000 ",100%
Total Revenue ,"4,685,082 ",100%
Total Revenue ,"4,600,000 ",100%
TOTALS ,"2,100,000 ",100%
Total Revenue ,"7,155,566 ",100%
Total Revenues ,"4,921,202 ",100%
Total Revenue ,"7,271,243 ",100%
Total Revenue ,"  3,950,000 ",100%
Total Revenue ,"6,307,000 ",100%
TOTALS ,"4,735,280 ",100%
Total Revenue ,"12,174,416 ",100%
Total Revenue ,"5,621,869 ",10
Total Revenue ,"25,344,282 ",100%
Total Income ,"7,301,738 ",100%
Total Revenue ,"3,185,644 ",100%
Total Revenue ,"9,703,081 ",100%
Total Revenue ,"3,383,760 ",100%
Total Revenue ,"3,591,408 ",100%
Total county recurrent expenditure ,"1,691,083",""
Total Revenue ,"4,550,416 ",100%
Total Revenue ,"3,228,455 ",100%
TotaL Revenue ,"4,153,466 ",100%
Total Revenue ,"2,878,870 ",100%
TOTALS ,"3,427,097 ",100%
Total Revenue ,"2,518,590 ",100%
Total Revenue ,"4,424,513 ",100%
Total Revenue ,"8,547,834 ",100%
Total Revenue ,"5,821,338 ",100%
Total Revenue ,"3,228,000 ",100%
Total Revenue ,"6,011,031 ",100%
Total Revenue ,"3,037,075 ",100%

PIPES!

pipes

Pipes are a way to pass the output of one command as the input of another.

sed

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g'

sed

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g'

>

Total Revenue ,"4,443,231
Total Revenue ,"4,607,000
Total Revenue ,"8,702,780
Total Revenue ,"4,429,799
Total Revenue ,"3,518,123
Total Revenue ,"3,246,694
Total Revenue ,"4,847,000
Total Revenue ,"5,304,685
Total Revenue ,"3,000,000
TOTALS ,"4,042,563
Total Revenue ,"13,255,550
Total Revenue ,"3,532,064
Total Revenue ,"12,631,400
Total Revenue ,"8,066,673
Total Revenue ,"3,267,959
Total Revenue ,"29,737,459
Total Revenue ,"8,000,000
Total Revenue ,"5,954,000
Total Revenue ,"4,685,082
Total Revenue ,"4,600,000
TOTALS ,"2,100,000
Total Revenue ,"7,155,566
Total Revenues ,"4,921,202
Total Revenue ,"7,271,243
Total Revenue ,"  3,950,000
Total Revenue ,"6,307,000
TOTALS ,"4,735,280
Total Revenue ,"12,174,416
Total Revenue ,"5,621,869
Total Revenue ,"25,344,282
Total Income ,"7,301,738
Total Revenue ,"3,185,644
Total Revenue ,"9,703,081
Total Revenue ,"3,383,760
Total Revenue ,"3,591,408
Total Revenue ,"4,550,416
Total Revenue ,"3,228,455
TotaL Revenue ,"4,153,466
Total Revenue ,"2,878,870
TOTALS ,"3,427,097
Total Revenue ,"2,518,590
Total Revenue ,"4,424,513
Total Revenue ,"8,547,834
Total Revenue ,"5,821,338
Total Revenue ,"3,228,000
Total Revenue ,"6,011,031
Total Revenue ,"3,037,075

sed

whatis sed

> sed(1) - stream editor for filtering and transforming text

sed

whatis sed

> sed(1) - stream editor for filtering and transforming text

sed

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' | sed -e 's/[^0-9]//g'

>

4443231
4607000
8702780
4429799
3518123
3246694
4847000
5304685
3000000
4042563
13255550
3532064
12631400
8066673
3267959
29737459
8000000
5954000
4685082
4600000
2100000
7155566
4921202
7271243
3950000
6307000
4735280
12174416
5621869
25344282
7301738
3185644
9703081
3383760
3591408
4550416
3228455
4153466
2878870
3427097
2518590
4424513
8547834
5821338
3228000
6011031
3037075

These numbers don't seem quite right, do they?

sed

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' |
sed -e 's/[^0-9]//g' | sed 's/$/000/'

>

4443231000
4607000000
8702780000
4429799000
3518123000
3246694000
4847000000
5304685000
3000000000
4042563000
13255550000
3532064000
12631400000
8066673000
3267959000
29737459000
8000000000
5954000000
4685082000
4600000000
2100000000
7155566000
4921202000
7271243000
3950000000
6307000000
4735280000
12174416000
5621869000
25344282000
7301738000
3185644000
9703081000
3383760000
3591408000
4550416000
3228455000
4153466000
2878870000
3427097000
2518590000
4424513000
8547834000
5821338000
3228000000
6011031000
3037075000

wget

wget http://pastebin.com/jrRVyJta -O counties.txt

shell redirection

tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' |
sed -e 's/[^0-9]//g' | sed 's/$/000/' > revenue.txt

Let's draw a pretty chart, shall we?

Matching our revenue data to counties

paste counties.txt revenue.txt

Let's draw a pretty chart, shall we?

paste counties.txt revenue.txt

>

baringo	4443231000
bomet	4607000000
bungoma	8702780000
busia	4429799000
elgeyo-marakwet	3518123000
embu	3246694000
garissa	4847000000
homa-bay	5304685000
isiolo	3000000000
kajiado	4042563000
kakamega	13255550000
kericho	3532064000
kiambu	12631400000
kilifi	8066673000
kirinyaga	3267959000
kisii	29737459000
kisumu	8000000000
kitui	5954000000
kwale	4685082000
laikipia	4600000000
lamu	2100000000
machakos	7155566000
makueni	4921202000
mandera	7271243000
marsabit	3950000000
meru	6307000000
migori	4735280000
mombasa	12174416000
muranga	5621869000
nairobi	25344282000
nakuru	7301738000
nandi	3185644000
narok	9703081000
nyamira	3383760000
nyandarua	3591408000
nyeri	4550416000
samburu	3228455000
siaya	4153466000
taita-taveta	2878870000
tana-river	3427097000
tharaka-nithi	2518590000
trans-nzoia	4424513000
turkana	8547834000
uasin-gishu	5821338000
vihiga	3228000000
wajir	6011031000
west-pokot	3037075000

paste

whatis paste

> paste(1) - merge corresponding or subsequent lines of files

gnuplot

whatis gnuplot

> gnuplot - an interactive plotting program

gnuplot

gnuplot

gnuplot> set term png
gnuplot> set output "revenue.png"
gnuplot> set boxwidth 0.5
gnuplot> set style fill solid
gnuplot> plot "revenue.txt" using 2:xtic(1) with boxes

Voila, charts!

Voila, charts!

Making the axis legible is left as an exercise for the reader 😊

Thank You!

Further reading

Run man <command> for any of the utilities today. It's that simple.

e.g. man sed

Credits

Pipes photo courtesy of Dave Crosby
https://www.flickr.com/photos/wikidave/4793251540/

 

Cat pie chart photo courtesy of Surviving the World

Made with Slides.com