Working with Data in the Unix shell
What is the shell?
The shell is a text based interface to the inner workings of a Unix-based operating system, such as Linux, OS X, and the BSD family, on which OS X itself is based.
Tools we're using today
wget - A HTTP client.
tabula - A PDF data extraction tool.
grep - A powerful search tool.
sed - A text transformation tool.
paste - A tool for matching corresponding lines in text files
gnuplot - A charting tool for the shell.
PIPES! - Easier to demonstrate than to explain.
What we hope to learn
Automating repetitive data-related tasks
Getting data out of PDFs (argh)
wget
wget comes installed by default on most Unix systems, and if not, is usually available quite easily from the package manager.
wget http://goo.gl/GYuyXF -O kenya-county-budgets-2013-2014.pdf
tabula
tabula is a pdf data extraction tool. It has a web based graphical interface, as well as a command line interface available as a separate download called tabula-extractor.
We're going to use both.
tabula
tabula -a y1,x1,y2,x2 -p 16 kenya-county-budgets-2013-2014.pdf
tabula
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf
So now you have a really messy CSV.
So what?
¯\_(ツ)_/¯
National Revenue ,"",""
a)Equitable Share ,"5,652,000",""
b)Equalization Fund ,"240,000",""
Sub-Totals ,"5,892,000 ",98%
Own Revenue Sources ,"",""
a)Revenue from Local Sources /Subcounties ,"119,031",""
b)Others ,-,""
Sub-Totals ,"119,031 ",2%
"",-,""
Conditional Grants ,-,""
a)Loans/Borrowing ,-,""
b)Contribution in Lieu of Rates ,-,""
c)Donor Funds ,-,""
d)Road Maintenance Levy Fund ,-,""
e)LATF ,-,""
f)Level five hospital ,-,""
Sub-Totals ,- ,0%
Total Revenue ,"6,011,031 ",100%
"",-,""
EXPENDITURE ,-,""
Recurrent Expenditure ,-,""
Recurrent Expenditure ,-,""
a) County Personnel ,"1,274,706",""
b)County Operations & Maintenance ,"1,482,636",""
Sub-Totals ,"2,757,342 ",40%
National Revenue ,-,""
a)Equitable Share ,"3,037,075",""
b)Equalization Fund ,-,""
Sub-Totals ,"3,037,075 ",100%
Own Revenue Sources ,-,""
a)Revenue from Local Sources /Subcounties ,-,""
b)Others ,-,""
Sub-Totals ,- ,0%
Conditional Grants ,-,""
a)Loans/Borrowing ,-,""
b)Contribution in Lieu of Rates ,-,""
c)Donor Funds ,-,""
d)Road Maintenance Levy Fund ,-,""
e)LATF ,-,""
f)Level five hospital ,-,""
Sub-Totals ,- ,0%
Total Revenue ,"3,037,075 ",100%
EXPENDITURE ,"",""
Recurrent Expenditure,"",""
a)County Assembly ,"362,123",""
a)County Executive ,"1,503,557",""
Sub-Totals ,"1,865,680 ",62%
Development Expenditure ,"",""
grep
whatis grep
> grep(1) - file pattern searcher
grep
whatis grep
> grep(1) - file pattern searcher
whatis whatis
> whatis(1) - search the whatis database for complete words
grep
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total'
grep
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total'
Total Revenue ,"4,443,231 ",100%
Total Revenue ,"4,607,000 ",100%
Total Revenue ,"8,702,780 ",100%
Total Revenue ,"4,429,799 ",100%
Total Revenue ,"3,518,123 ",100%
Total Revenue ,"3,246,694 ",100%
Total Revenue ,"4,847,000 ",100%
Total Revenue ,"5,304,685 ",100%
Total Revenue ,"3,000,000 ",100%
TOTALS ,"4,042,563 ",100%
Total Revenue ,"13,255,550 ",100%
Total Revenue ,"3,532,064 ",100%
Total Revenue ,"12,631,400 ",100%
Total Revenue ,"8,066,673 ",100%
Total Revenue ,"3,267,959 ",100%
Total Revenue ,"29,737,459 ",100%
Total Revenue ,"8,000,000 ",100%
Total Revenue ,"5,954,000 ",100%
Total Revenue ,"4,685,082 ",100%
Total Revenue ,"4,600,000 ",100%
TOTALS ,"2,100,000 ",100%
Total Revenue ,"7,155,566 ",100%
Total Revenues ,"4,921,202 ",100%
Total Revenue ,"7,271,243 ",100%
Total Revenue ," 3,950,000 ",100%
Total Revenue ,"6,307,000 ",100%
TOTALS ,"4,735,280 ",100%
Total Revenue ,"12,174,416 ",100%
Total Revenue ,"5,621,869 ",10
Total Revenue ,"25,344,282 ",100%
Total Income ,"7,301,738 ",100%
Total Revenue ,"3,185,644 ",100%
Total Revenue ,"9,703,081 ",100%
Total Revenue ,"3,383,760 ",100%
Total Revenue ,"3,591,408 ",100%
Total county recurrent expenditure ,"1,691,083",""
Total Revenue ,"4,550,416 ",100%
Total Revenue ,"3,228,455 ",100%
TotaL Revenue ,"4,153,466 ",100%
Total Revenue ,"2,878,870 ",100%
TOTALS ,"3,427,097 ",100%
Total Revenue ,"2,518,590 ",100%
Total Revenue ,"4,424,513 ",100%
Total Revenue ,"8,547,834 ",100%
Total Revenue ,"5,821,338 ",100%
Total Revenue ,"3,228,000 ",100%
Total Revenue ,"6,011,031 ",100%
Total Revenue ,"3,037,075 ",100%
PIPES!
pipes
Pipes are a way to pass the output of one command as the input of another.
sed
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g'
sed
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g'
>
Total Revenue ,"4,443,231
Total Revenue ,"4,607,000
Total Revenue ,"8,702,780
Total Revenue ,"4,429,799
Total Revenue ,"3,518,123
Total Revenue ,"3,246,694
Total Revenue ,"4,847,000
Total Revenue ,"5,304,685
Total Revenue ,"3,000,000
TOTALS ,"4,042,563
Total Revenue ,"13,255,550
Total Revenue ,"3,532,064
Total Revenue ,"12,631,400
Total Revenue ,"8,066,673
Total Revenue ,"3,267,959
Total Revenue ,"29,737,459
Total Revenue ,"8,000,000
Total Revenue ,"5,954,000
Total Revenue ,"4,685,082
Total Revenue ,"4,600,000
TOTALS ,"2,100,000
Total Revenue ,"7,155,566
Total Revenues ,"4,921,202
Total Revenue ,"7,271,243
Total Revenue ," 3,950,000
Total Revenue ,"6,307,000
TOTALS ,"4,735,280
Total Revenue ,"12,174,416
Total Revenue ,"5,621,869
Total Revenue ,"25,344,282
Total Income ,"7,301,738
Total Revenue ,"3,185,644
Total Revenue ,"9,703,081
Total Revenue ,"3,383,760
Total Revenue ,"3,591,408
Total Revenue ,"4,550,416
Total Revenue ,"3,228,455
TotaL Revenue ,"4,153,466
Total Revenue ,"2,878,870
TOTALS ,"3,427,097
Total Revenue ,"2,518,590
Total Revenue ,"4,424,513
Total Revenue ,"8,547,834
Total Revenue ,"5,821,338
Total Revenue ,"3,228,000
Total Revenue ,"6,011,031
Total Revenue ,"3,037,075
sed
whatis sed
> sed(1) - stream editor for filtering and transforming text
sed
whatis sed
> sed(1) - stream editor for filtering and transforming text
sed
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' | sed -e 's/[^0-9]//g'
>
4443231
4607000
8702780
4429799
3518123
3246694
4847000
5304685
3000000
4042563
13255550
3532064
12631400
8066673
3267959
29737459
8000000
5954000
4685082
4600000
2100000
7155566
4921202
7271243
3950000
6307000
4735280
12174416
5621869
25344282
7301738
3185644
9703081
3383760
3591408
4550416
3228455
4153466
2878870
3427097
2518590
4424513
8547834
5821338
3228000
6011031
3037075
These numbers don't seem quite right, do they?
sed
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' |
sed -e 's/[^0-9]//g' | sed 's/$/000/'
>
4443231000
4607000000
8702780000
4429799000
3518123000
3246694000
4847000000
5304685000
3000000000
4042563000
13255550000
3532064000
12631400000
8066673000
3267959000
29737459000
8000000000
5954000000
4685082000
4600000000
2100000000
7155566000
4921202000
7271243000
3950000000
6307000000
4735280000
12174416000
5621869000
25344282000
7301738000
3185644000
9703081000
3383760000
3591408000
4550416000
3228455000
4153466000
2878870000
3427097000
2518590000
4424513000
8547834000
5821338000
3228000000
6011031000
3037075000
wget
wget http://pastebin.com/jrRVyJta -O counties.txt
shell redirection
tabula -a 130,71,410,524 -p 16-62 kenya-county-budgets-2013-2014.pdf -i |
grep -i -E '^["]*total' | sed -e 's/ ".*$//g' |
sed -e 's/[^0-9]//g' | sed 's/$/000/' > revenue.txt
Let's draw a pretty chart, shall we?
Matching our revenue data to counties
paste counties.txt revenue.txt
Let's draw a pretty chart, shall we?
paste counties.txt revenue.txt
>
baringo 4443231000
bomet 4607000000
bungoma 8702780000
busia 4429799000
elgeyo-marakwet 3518123000
embu 3246694000
garissa 4847000000
homa-bay 5304685000
isiolo 3000000000
kajiado 4042563000
kakamega 13255550000
kericho 3532064000
kiambu 12631400000
kilifi 8066673000
kirinyaga 3267959000
kisii 29737459000
kisumu 8000000000
kitui 5954000000
kwale 4685082000
laikipia 4600000000
lamu 2100000000
machakos 7155566000
makueni 4921202000
mandera 7271243000
marsabit 3950000000
meru 6307000000
migori 4735280000
mombasa 12174416000
muranga 5621869000
nairobi 25344282000
nakuru 7301738000
nandi 3185644000
narok 9703081000
nyamira 3383760000
nyandarua 3591408000
nyeri 4550416000
samburu 3228455000
siaya 4153466000
taita-taveta 2878870000
tana-river 3427097000
tharaka-nithi 2518590000
trans-nzoia 4424513000
turkana 8547834000
uasin-gishu 5821338000
vihiga 3228000000
wajir 6011031000
west-pokot 3037075000
paste
whatis paste
> paste(1) - merge corresponding or subsequent lines of files
gnuplot
whatis gnuplot
> gnuplot - an interactive plotting program
gnuplot
gnuplot
gnuplot> set term png
gnuplot> set output "revenue.png"
gnuplot> set boxwidth 0.5
gnuplot> set style fill solid
gnuplot> plot "revenue.txt" using 2:xtic(1) with boxes
Voila, charts!
Voila, charts!
Making the axis legible is left as an exercise for the reader 😊
Thank You!
Further reading
Run man <command> for any of the utilities today. It's that simple.
e.g. man sed
Credits
Pipes photo courtesy of Dave Crosby
https://www.flickr.com/photos/wikidave/4793251540/
Cat pie chart photo courtesy of Surviving the World
Working with Data in the Unix Shell
By Okal Otieno
Working with Data in the Unix Shell
- 3,118