The Hitchhiker's Guide to the Web

(Or, If Sisyphus Had Broadband)

Kienan Knight-Boehm

RCOS // 10.10.14 // kienankb.com

You     can
Download
Wikipedia.

You might not want to,

but believe me,

you can.

The BZIP2 file

is 10.6GB total.

The single XML file--

~14.75 million articles*--

is 46GB.

*14,753,874 exactly.

Notepad++

Sublime

Atom?

...MS Notepad?

...oh that's adorable.

How do you explore a file...

...when you can't open the file?

Break it down.

<mediawiki xml:lang="en">
    <page>
      <title>Page title</title>
      <restrictions>edit=sysop:move=sysop</restrictions>
      <revision>
        <timestamp>2001-01-15T13:15:00Z</timestamp>
        <contributor><username>Foobar</username></contributor>
        <comment>I have just one thing to say!</comment>
        <text>A bunch of [[text]] here.</text>
        <minor />
      </revision>
      <revision>
        <timestamp>2001-01-15T13:10:27Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>new!</comment>
        <text>An earlier [[revision]].</text>
      </revision>
    </page>
</mediawiki>

Example

Structure

  • 1 file/article
  • [_].xml
  • [_ _ _].xml
  • Froze my laptop
  • Massive files
  • Slowdown over time (?)

Parsing/Splitting!

Python XML parsing?  Nah.

HOMEBREWED.

(But, seriously, screw DOS artifacts.)

(Just screw 'em.)

Pass 1: Python

  • 36 lines of code
  • Properly parsed & redistributed data
  • Took ~20s for 1000 articles

ETA: 82 hours

Even more oh god

DOS Artifacts

  • CON
  • PRN
  • AUX
  • NUL
  • COMx
  • LPTx

PASS 2: C++

  • Negligible speed difference (post-refactor)
  • Functionally the same as Python
  • Made me happier

ETA: ?

Metadata Strip

789577286 lines

658465392 lines

Pass 3: C++ again

  • Rewritten to account for simplified data
  • Felt like a cleaner solution
  • Still C++, so I was still happy

ETA:

I did nothing to deserve this

Literally nothing

Rendering/Display

Demo (JAVA)

Hardware

WHY?

always know where your towel is

Thanks to:

Sean O'Sullivan

Major Moorthy and Super G

RCOS

Special thanks: DOS artifact guy

Hitchhiking Wikipedia

By Kienan Knight-Boehm

Hitchhiking Wikipedia

"Wait--but why would you download Wikipedia?" "No really, but why?"

  • 1,134