Introduction to the blockchain

Toptal academy blockchain lectures #1

2018-02-21

ivan.voras@toptal.com

Step 1: Do you need a blockchain?

The End

Thank you for your attention

I'll be here all day...

Let me introduce myself

  • Ivan Voras, PhD <ivan.voras@toptal.com>
  • In the Toptal network since 2014
  • Doing blockchain development before it was cool, did server back-end and kernel development before that
  • Did some early altcoins which didn't survive to this day, so I definitely didn't get rich from it :)
  • Member of the Technical screening team for the blockchain specialisation since it started

Wrote a "for dummies"
book about it

http://scepticsguide.ivoras.net/

(It's not for the technically-minded)

Today's agenda

  • What is the blockchain, as a data structure
  • A few words on cryptography
  • Pros and cons of using blockchains
  • How is it commonly being implemented (Bitcoin, Ethereum)
  • What are the trade-offs in different implementations
  • Merkle trees / hashes
  • Distributed consensus
  • Common attacks and security implications

Consider the linked list

It's commonly implemented like this (as a singly-linked list):

Node 1 (head)

Next node pointer

Node 2

Next node pointer

Node N (tail)

NULL

...

Adding a new node

Nodes can be added to either end. We're interested in the case of adding nodes to the front:

Node 1 (head) 2

Next node pointer

Node 2 3

Next node pointer

Node N (tail)

NULL

New node (new head)

Next node pointer

...

From linked lists to blockchains

Step 1: Replace pointers with cryptographic signatures.

Step 2: Done.

Node 3

Previous block signature

Node 2

Previous block signature

Node 1 (genesis block)

Previous: NULL

New node (new block)

Previous block signature

This becomes the "first" (head) node

Every new node includes a cryptographic signature of the previous node (usually just a hash)

...

So what are blockchains?

Blockchains are a data structure where data is grouped in blocks, where each block (among other things) contains a cryptographic signature (usually a hash) of the previous block in the chain.

 

This structure guarantees that each new block effectively contains a signature of all previous blocks, making the data stored in the blockchain immutable (i.e. any changes become obvious because the signatures don't match).

 

Because of this, it's suitable for sensitive data, e.g. financial data, and that is why it was used in Bitcoin and other cryptocurrencies.

So what are blockchains?

Having blocks which, among other useful data, contain digital signatures of the preceding block, is sufficient to make a blockchain data structure. On top of that, other, database-like functionalities are usually added.

 

In practice, think of a blockchain as a type of database.

"relational database", "graph database", "nosql document database" and "blockchain" are all types of databases.

 

Bitcoin contains a specific implementation of a
blockchain database, the same as PostgreSQL contains a specific implementation of a relational database.

Short introduction to cryptography

Hash functions

They accept inputs of arbitrary sizes (buffers of bytes), and produce a fixed-sized output. Because of that they are sometimes called "compression functions" but that name is misleading because they are unidirectional, irreversible
(there's no "uncompress")

 

Cryptographic hashes

In addition to these, have the following properties:

  • Extremely low chance that two inputs will produce the same hash output (collision resistance / pre-image resistance)
  • Even a small change in input, e.g. 1 bit, leads to a huge change in the hash output (avalanching)

Common hash functions are SHA256 and SHA3-256.

Short introduction to cryptography

Is CRC32 a good cryptographic hash?

NO.

#1: 32-bits is way too short: even without any advanced techniques or theory, it's enough to make at most 2^32=4Gi changes to a custom document you create in order for it to have the same CRC32() as the original one. This can be done quickly on modern CPUs. If you apply cryptanalysis, a little theory from Wikipedia, you can find CRC32() collisions much faster.

#2: It deliberately has a structure which was created for non-security purposes, it does not have good "avalanching"

 

Homework:

Find a document (a buffer of ASCII bytes) which has the CRC32 hash (the gzip variant of CRC32), of 0xbbd1264f.

Short introduction to cryptography

Symmetric ("ordinary") cryptography

This is the one where the same password is used
for encryption and decryption.

Important: Passwords, when stored in a database
should not be encrypted, because encryption is reversible;
they should be hashed.


There are block encryption algorithms and stream encryption algorithms. The former (e.g. AES) only encrypt fixed-sized blocks of data (e.g. 16 bytes), and the latter encrypt arbitrary data sizes. Commonly, block encryption algorithms can be adapted with special additional algorithms to create stream encryption algorithms (e.g. AES-CTR).


Short introduction to cryptography

Symmetric ("ordinary") cryptography

AES256: this algorithm encrypts blocks of 16 bytes (128 bits), and uses 32-byte keys (256 bits, hence the name) to do it. Keys here are often the result of hashing a user-entered password.

 

In practice, the basic algorithm always needs to be augmented with additional processing to avoid specific forms of attacks. Block ciphers at the very least need to be chained so the output of a previous block is XOR-ed with the input of the next, which is called the CBC "mode". Hence, AES256-CBC. Other common modes are CTR and GCM which convert AES to a stream cipher. 

See Bruce Schneier: "Cryptography engineering", or
the "Crypto 101" on-line intro.

Never ever invent your own cipher or mode!

Short introduction to cryptography

Never ever invent your own cipher or mode!

 

Because people have spent their entire lives or at least academic careers finding out the right way to encrypt data, and combine encryption methods, so that they are safe - and still failed.

 

The probability that you will think something up over the weekend and create a working, secure cipher are ... low,
very, very low.

 

Literally - this is mathematics more than CS. You need a career devoted to cryptography just to be aware of all the methods that have previously been tried and failed.

Short introduction to cryptography

Asymmetric (public key) cryptography

Commonly, there are 2 keys which are related. What is encrypted with one of them, can be decrypted only with the other. These algorithms are generally much slower than symmetric algorithms, so are used in combination with them to increase the overall performance. If A wants to send a message to B:

 

  1. A obtains B's public key. This key is not secret
  2. A generates a random password of 32 bytes
  3. A encrypts the entire message using AES with this password
  4. A encrypts this random password with B's public key
  5. A sends the encrypted password, and the encrypted message
  6. B decrypts the random password with their private key, then decrypts the message using AES 

Short introduction to cryptography

Asymmetric (public key) signatures

Signatures can be implemented as encryption of the hash of the message. If A wishes to sign a message (i.e. vouch that the message is authentic so that B can verify its authenticity):

 

  1. A generates a 32-byte hash of the message using SHA256
  2. A encrypts the hash with their private key
  3. A publishes the encrypted hash and their public key
  4. B uses A's public key to decrypt the A's hash
  5. B generates a hash of the message using SHA256
  6. B compares if their calculated hash matches A's hash

 

This proves that A had the same message as B and that it intentionally signed it (i.e. encrypted its hash).

Short introduction to cryptography

Homework

I've used the "openssl" command line utility (available on Linux, Windows and OSX) in this way:

openssl genrsa -out keypair.pem 2048
openssl rsa -in keypair.pem -out public.pem -outform PEM -pubout

my public key from public.pem is:

-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA0UyKNoJYVqwW3Cte31Ec
HAc8fAUmeR0UfbuCCkpyOSbADirXuVNiVYpQgPkphlml3KrgprAdA/5X1xUujF+l
mxp+Bm3UsskpB+N55Nulep0uW0BdKx1DiCuAj6qVDc+Tqp1i7NvqDbIiUAL2VlUq
9sunUdiCbaztuayam8gqmqnO73dlboMiB5DK/OitmNJcGW9I8LkZvZLiS9znQgJz
lrbCeCK0ym3KbVGo3ZG8ei8Zbf8EiGxdb84V0QDvTm15l3gqiOcmH++HqNkbUeXZ
b+fzCFFZoeHD67od5TXh1YlfeiW3WLP7ZT3kROD4hVgIbjYQmEtLc50WXXRIYTAg
5QIDAQAB
-----END PUBLIC KEY-----

Short introduction to cryptography

Homework

With my private key, I have signed the file found on this link:

https://i.imgur.com/O7yLqqq.jpg

by using this command:

openssl dgst -sha256 -sign keypair.pem -out file.signature O7yLqqq.jpg

I've used the base64 command-line utility to generate this representation of the "file.signature" file:

Btdgy9GhfmX0fphIC77as2s5OU5xLDLnjrpPP3uujg55Wf7vIJ8OW47Kcw0VYCVh/kFZtwLfhsgv
xQbZqPzyr2PEAqA8Y5e7Pp1NtX4w7qgBgV3VGEl6oWKXHwU/z0cMhZ9U6m5IzaENMLUaLjjHvBcT
yYHxCXRMytyh9s5LmlRisjAH9xuJIVqz623dALlwTabypdL8PnwEiwRwH+3KCbKH1LvWu0i696kY
YOR0kTib2mOOI/R5jiQpYuo8Qnm8TwBk04wplSgcZ/OHr7arTeZ9yZTRCKnl8Gq7qc1lPj8BtCs7
5x+gruR5G5LuCPUABhTSVA1KPb50aV8xyM1IkQ==

Your task is to write down the openssl command line which will verify my signature of that file.

Pros and cons to using blockchains

Pros:

  • Blockchains can rather naturally be used as distributed append-only databases: make every node generate data in form of blocks, and create a procedure for publishing blocks.
  • The data in the blockchain is practically immutable, making it attractive for sensitive, semi-authenticated data.

Cons:

  • If the aim is to create a globally distributed database, the currently popular ways to create blockchains are slow in terms of transactions per second.
  • Everyone does it differently: there are still no established "blockchain products" (on the one side, it's because it's rather simple to implement...)

Pros and cons to using blockchains

In many ways, saying that a product is using
"blockchain technology"
is similar to saying a project uses
"linked list technology"

... what matters is not the technology but what you do with it.

Public blockchains

Everyone can participate.

So data needs to be accessible by everyone, including for mining.

Leads to the design of Proof of Work mining.

Security concerns lead to less performance: the blockchain is global, its operations need to be globally synchronised.

Private blockchains

Only a known set of nodes can participate.

Those nodes can identify themselves directly.

No need for Proof of Work.

Leads to Proof of Authority design.

Can be more performant as the nodes are known and there's a known number (or upper bound) of them.

Popular blockchains: Bitcoin, Ethereum

Bitcoin contains the first popular implementation of the blockchain, it brought the concept into the spotlight.

Pioneered the concept of combining blockchains with "proof of work" algorithms in order to force the users to "work hard" to "sign" the blocks: mining was born.

Dirty "implementation is the specification" principle - the "Bitcoin system" is whatever the "Bitcoin code" implements.

Ethereum contains arguably the second (by popularity) blockchain, it extended the original idea.

Has a more-or-less well-defined specification outside the base implementation - there are Ethereum nodes in Go, C++, Python...

A more elegant implementation, extending all features in some way: faster block mining, adding smart contracts, supporting "uncle" blocks, simpler transaction structure...

tRADE-OFFS IN BLOCKCHAIN DESIGN

  1. Transaction volume / latency vs block size. If a large number of transactions is allowed, block sizes will grow. 
  2. Block time. In all popular cryptocurrencies, the algorithms make sure the blocks are generated approximately in the same time intervals (10 minutes for Bitcoin, 15 seconds for Ethereum), so the difficulty of mining rises. As the block time gets lower, the problems and latencies of distributing each block to the entire network rise.
  3. Method of creating (mining) new blocks. Proof of Work is an environmental disaster: more electrical power is used for Bitcoin than the country of Ireland. Proof of Stake is a possible replacement, but it directly leads to "rich getting richer" social structures (not relevant for private blockchains).

Common concepts: Merkle trees

Used to construct hashes for large objects from their parts.

Motivation: sometimes the parts are unknown or too big.

Common concepts: Distributed consensus

As implemented in common cryptocurrency blockchains it literally means "everyone who runs the same executable follows the same rules."

 

Examples of consensus rules are:

  • Which transactions, which blocks are valid
  • Which coins can be spent
  • How mining difficulty is adjusted

 

For common cryptocurrencies, these are often a row of if-thens which check the contents of a block, taking into account previous blocks, and deciding if the new one is valid.

For example: "calculate a new difficulty based on past N blocks; if the difficulty from the newly received block is lesser, discard it."

Common concepts: Hard forks and soft forks

The concepts tie in with with the idea of distributed consensus.

Soft fork

 

Introduces limitations on what is valid. New versions of software simply stop producing some forms of transactions / blocks, etc. which were previously valid.

 

Old software continues to accept data created by new versions - the data will simply lack certain features.

Hard fork

 

Introduces new features which old versions of software do not support or recognise as valid. Backwards-incompatible.

 

Old version of software will discard data created by new versions - often because it has features it doesn't know how to handle.

Common security concerns

Many of these are also tied in with the distributed consensus.

 

  1. Wallet compromise / password stealing / social engineering. This one is most common by far.
  2. Politics. In Bitcoin's case, influential people have continually steered the course of development to their personal benefit.
  3. 51% attack. The way distributed consensus works, the majority wins. The consensus is effectively just program code. So if 51% of all miners decide to run an executable with a certain set of rules, they control the blockchain.
  4. Transaction selectivity attacks. Miners (esp. if colluding) basically have the power to pick which transactions go into blocks, and (sometimes more importantly) when.

So the blockchain is not perfect...

... but it's good for the specific niches where its good sides outweigh the bad ones. Specifically:

 

  • As a distributed, append-only, database. Multiple parties can add data (new blocks) to it almost simultaneously on a very large scale (but also on a smaller scale).
  • For storing sensitive records in an immutable way.
  • For creating a wide-spread consensus about data.

 

Outside of those, simpler solutions are likely to be better.

Final word: Interoperability

Algorithms and data structures used for communication are standardised because every party needs to understand them.

 

Pay special attention that e.g. SHA256 is defined as accepting binary input (bytes) and produces a binary output (32 bytes).
It is NOT defined as producing 64 hexadecimal characters as output.

 

When dealing with binary data structures, things as endianess might be a part of the specification.

THE END

ivan.voras@toptal.com

Blockchain lecture #1: Introduction

February 2018

Q&A?

Blockchain lectures #1: Introduction

By Ivan Voras

Blockchain lectures #1: Introduction

Introduction: what the blockchain is, what are its optimal use cases, how is it implemented today in major cryptocurrencies: Bitcoin, Ethereum. What are the key components of a blockchains and how they are implemented. Differences and similarities between blockchain implementations. "Why" in addition to "how".

  • 1,327