UNLEASH the POWER of UNIX

Pipes and great little Unix Tools

a 5 min.   talk




Johannes Boyne @zweitag about.me/johannesboyne | github.com/johannesboyne |  twitter.com/johannesboyne

UNIX PHILOSOPHY

  • Small is beautiful
  • Make each program do one thing well
  • Expect the output of every program to become
    the input to another
  • Make every program a filter

This is the Unix philosophy:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

UNIX § Rules §

Modularity
Clarity
Composition
Simplicity
Robustness
Silence

UNIX | Pipes |

UNIX | Pipes |

UNIX  Tools

ls | cat | grep | find | wc | awk | sed | tr | sort | uniq

Examples

 $ cat svn.log | egrep 'JohannesBoyne' | awk '{print $5}' | uniq | wc -l
 $ curl https://us-east.manta.joyent.com/manta/public/examples/shakespeare/ | egrep -o '"name":.*?[^\\]",' | sed 's/"name":/  /' | awk '{print $1}' | tr -d '"|,'
 $ find . -name "*.txt" -ls | awk '{total += $7} END {print total/1000000 " MB"}'
 $ cat *.txt | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | awk '{ x[$2] += $1 } END { for (w in x) { print x[w] " " w } }' | sort -rn | head -10

$ grep -i -c ruby *.txt | sed 's/.*:/ /' | awk '{total += $1} END {print total}'
$ grep -i -c -h 'john' *.txt | awk '{total += $1} END {print total}' //shorter not faster
 $ find *.txt | while read fn; do ./indexone.sh $fn ; done | awk -f indexmerge.awk | sort | head -10

Disclaimer: Many of the example are borrowed from www.joyent.com/products/manta/job-examples


Unix vs. Hadoop

(careful this is no science!)
 cat *.txt | tr '[:upper:]' '[:lower:]' | tr -cs a-z '\n' | uniq -c | awk '{ x[$2] += $1 } END { for (w in x) { print x[w] " " w } }' | sort -rn | head -10
vs.
// source from: http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Example%3A+WordCount+v1.0
package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WordCount {

    public static class Map extends MapReduceBase implements Mapper {
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

      public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
          word.set(tokenizer.nextToken());
          output.collect(word, one);
        }
      }
    }

    public static class Reduce extends MapReduceBase implements Reducer {
      public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
          sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
      }
    }

    public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(WordCount.class);
      conf.setJobName("wordcount");

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);

      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      FileInputFormat.setInputPaths(conf, new Path(args[0]));
      FileOutputFormat.setOutputPath(conf, new Path(args[1]));

      JobClient.runJob(conf);
    }
}


Why should you care?


You have a lot of power under your hood!

or...
How about a System like Hadoop built with such great Unix-Tools?

thanks
Made with Slides.com