Rory Powell - 40081529
7th December 2015
Project Supervisor: Professor Weiru Liu
Technical Advisor: Ryan McConville
What is it?
Why?
How?
The data set (a collection of hexadecimal files) is parsed into a usable format using bitcoinj
A java library for handling bitcoin
As the data set is being parsed, the inputs and outputs of each transaction are extracted and written to disk in the form of a CSV file
Finally the CSV file is loaded in Gephi and manipulated as desired to highlight the communities present
Obtaining a data set is not as straightforward as it seems, I had to use a bitcoin wallet that pulled down 5 years of history over a peer to peer network
As it turns out, that history is huge. Parsing the data set into memory as I had originally intended quickly became impossible after several attempted workarounds
Extracting the addresses from parsed data on the fly was much more suitable, however this was a challenge in itself as the API is a work in progress and does not handle auto-generated coins very well
Bitcoin addresses are not like bank accounts, any one person can have any amount of addresses at a given time
In fact as a general guideline users are encouraged to use a single address for a single transaction
This does not bode well for community detection, if everyone is using a new address every time, there are no communities
As a first attempt the results have been promising, however there is a clear limitation on the load that Gephi can handle
A single .dat file (out of a few hundred I have available) produced 503, 794 lines of relational indicators, I have been thus far been able to make Gephi work reliably with just up to 7,000 lines. The application becomes unusable at higher counts
Some Statistics
.dat file # | csv line count | csv line count including coin base |
---|---|---|
blk00000 | 297,821 | 300,672 |
blk00080 | 491,387 | 491,930 |
blk00160 | 503,485 | 503,794 |
The underlying algorithm in use within Gephi is the Louvain method for community detection in large networks
It is today one of the most widely used method for detecting communities in large networks
I will make use of the Louvain algorithm directly instead of using Gephi as well as other community detection algorithms
Investigate other visualization tools available
Combat performance issues
Analysis results from this demonstration and try to identify trends from the data