js / java / py enthusiast, architect & developer, trainer, community activist
9th April 2013, Warsaw
noSQL databases has been around for a long time
- the term is just new
noSQL is where either:
- relations are not rigid (not schema-based: documents, graphs, wide-column, key-value)
- transactions are not the main goal by itself (who gives the fuck about ACID?)
- data doesn't have to be consistent throughout the system all the time
why do we need noSQL?
Relational databases enforce ACID rules. So, you will have schema-based transaction-oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
some of they say that...
the rise of noSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently
Scaling horizontally (sharding)
Consistency, Scalability, Availability, Replication...
- quick Database Overview
- noSQL in your company
- noSQL implementations
- www resources
there will be no explicit demo
only some code samples will be shown
Quick Database Overview
set of properties that guarantee that database transactions are processed reliably
transaction = a single logical operation on the data (including multiple changes)
transaction example: transfer of funds from one bank account to another (lower the FROM account and raise the TO account)
A for Atomicity
Atomicity requires that each transaction is "all or nothing"
Subtract value of x from A and add it to B. Substracting from A has succeeded, but the system is unable to add the x value to B.
C for Consistency
The consistency property ensures that any transaction will bring the database from one valid state to another valid state.
Consistency is a very general term which demands that the data must meet all validation rules. Assume that transaction is going to subtract x from A without adding x to B. Before the transaction, A + B = y. After the transaction, A + B = y - 10.
or imagine a simple 1:n relation with a ON UPDATE/DELETE CASCADE that doesn't work properly. We'd end up with a Comment row related to a non-existent BlogPost row.
I for Isolation
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other.
T subtracts 10 from A
T adds 10 to B
T subtracts 10 from B
T adds 10 to A
If T waits for T to be completed, everything is fine. But if it doesn't, consider that T fails after subtracting and T comes inbetween...
D for Durability
Come what may, our data is safe.
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
x is subtracted from A, everything is saved. After x has been added to B, the system returned "OK" message and stored the operation in the cache (RAM). Meanwhile, an electricity failure went out.
Eric Brewer stated it in 2000 during Symposium on Principles of Distributed Computing:
it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
- Consistency (aCid - all DBMS nodes see the same data at the same time)
- Availability (a guarantee that every request receives a response about whether it was successful or failed)
- Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Network Distributed Systems
CAP theorem applies to distributed DBMS used by big network systems (that's why we talk about Facebook, Google, etc).
If we are considering single-machine DBMS, Partition Tolerance is irrelevant. Consistency and Availablity are easly achievable on one machine.
2 out of 3
in other words...
Consistency means that data is the same across the cluster, so you can read or write to/from any node and get the same data.
Availability means the ability to access the cluster even if a node in the cluster goes down.
Partition Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).
CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved)
CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down
AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)
Andrew Oliver, InfoWorld:
Part of the reason there are so many different types of NoSQL databases lies in the CAP theorem, aka Brewer’s Theorem. The CAP theorem states you can provide only two out of the following three characteristics: consistency, availability, and partition tolerance.
Different datasets and different runtime rules cause you to make different trade-offs. Different database technologies focus on different trade-offs. The complexity of the data and the scalability of the system also come into play.
Relational databases are based on relational algebra, which is more or less an outgrowth of set theory. Relationships based on set theory are effective for many datasets, but where parent-child or distance of relationships are required, set theory isn’t very effective. You may need graph theory to efficiently design a data solution.
In other words, relational databases are overkill for data that can be effectively used as key-value pairs and underkill for data that needs more context. Overkill costs you scalability; underkill costs you performance.
an alternative to ACID:
- Basically Available
- Soft state (panta rei)
Eventual consistency (given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent)
stands in contrast to other, more traditional relational databases whose data has ACID
BASE: why ACID alternative?
Soft State. BASE is diametrically opposed to ACID. Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux. Although this sounds impossible to cope with, in reality it is quite manageable and leads to levels of scalability that cannot be obtained with ACID.
Basic Availability. The availability of BASE is achieved through supporting partial failures without total system failure. Here is a simple example: if users are partitioned across five database servers, BASE design encourages crafting operations in such a way that a user database failure impacts only the 20 percent of the users on that particular host. There is no magic involved, but this does lead to higher perceived availability of the system.
BASE vs ACID by means of CAP
CAP is basically a continuum along which BASE and ACID are on opposite ends.
CAP is Consistency, Availability, and Partition tolerance. Basically you can pick 2 of those but you can't do all 3.
- ACID focuses on Consistency and Availability.
- BASE focuses on Partition Tolerance and Availability and throws consistency out the window.
CAP as ACID/BASE continuum
You can decide how close you want to be to one end of the continuum or the other according to your priorities.
- data orientation
- price :)
main problem: D in ACID
MySQL (ENGINE = MEMORY)
Memcached (access RAM via cURL)
- MySQL (ENGINE = MEMORY)
noSQL in your company
Suppose your company name is X...
Does X have it's own noSQL just like Google, Amazon, Facebook, etc.?
X operates on billions of records daily, but it has only its own file system.
Well, does it use any available noSQL implementation?
X has a long tradition of reinventing the wheel with C++.
I assumed no C++ developers will come to this seminar...
Now we may discuss
written in Erlang
HTTP RESTful API
JSON document format
first released in 2005
became an Apache project in 2008
Distributed Architecture with Replication
CouchDB was designed with bi-direction replication (or synchronization) and off-line operation in mind. That means multiple replicas can have their own copies of the same data, modify it, and then sync those changes at a later time.
taskell? not really...
JS/CouchDB MapReduce: http://www.slideshare.net/okurow/couchdb-mapreduce-13321353
a relational database stores data and relationships in tables
in couchDB a database is a collection of independent documents
couchDB docs: http://wiki.apache.org/couchdb/
huMONGOus (extremely big)
- written in C++
supposed to be the most popular noSQL DBMS (see wikipedia)
- indices instead of map/reduce functions
- ad-hoc queries (search by field, range queries, regular expression searches)
- document-oriented database
BSON document format (binary JSON)
MongoDB supports master-slave replication
A master can perform reads and writes
A slave copies data from the master and can only be used for reads or backup (not writes)
The slaves have the ability to select a new master if the current one goes down
2 bears comparing MySQL and MongoDB:
main features: https://en.wikipedia.org/wiki/MongoDB#Main_features
- in-memory key-value storage
- written in C
supposed NOT to be any kind of a database
- often used to speed up dynamic database-driven websites
- used by sites including YouTube, Reddit, Zynga, Facebook, Orange, Twitter and Wikipedia
- simple PHP memcache demo: http://www.9lessons.info/2012/02/memcached-with-php.html
Official PHP modules
- wide-column storage (append lots of columns to a row)
- partitioning, replication
- eventually consistent
- initially developed @Facebook and powered their Inbox Search (until late 2010, then became Apache project)
- written in Java
- communication via Thrift
- MapReduce support: Hadoop integration
add new column + value dynamically
- Column = (key, value, timestamp)
- SuperColumn = (key =>value; value = Column) = map
- ColumnFamily - structure with infinite number of rows, more or less a RDBMS-alike table
- SuperColumnFamily - the largest container, infinite number of rows (row => ColumnFamily), one more dimension than ColumnFamily
- Keyspace - contains ColumnFamilies, RDBMS-alike schema
- CQL (Cypher Query Language)
PHP Object-Graph Mapper (OGM):
Practice = when to use noSQL instead of RDBMS?
NoSQL solutions are usually meant to solve a problem that:
- RDBMS are either not well suited for
- RDBMS are too expansive to use (Oracle)
require you to implement something that breaks the relational nature of your db anyway
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS, there is no reason to choose NoSQL.
when to use noSQL
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on a high traffic site
the data makes no sense in SQL, you find yourself doing multiple JOIN queries for accessing some piece of information
you are breaking the relational model, you have CLOBs that store denormalized data and you generate external indexes to search that data
It is perfectly fine to use any technology in conjunction with another, e.g. MongoDB and MySQL work fine together as long as they aren't on the same machine
Typical RDBMSs make strong guaranties about consistency. This requires to some extend communication between nodes for every transaction. This limites the ability to scale out, because more nodes means more communication.
NoSql systems make different trade offs. For example they don't guarantee that a second session will see immediately data commited by a first session. Thereby decoupling the transaction of storing some data from the process of making that data available for every user (eventually consistent). So a single transaction doesn't need to wait for any (or for much less) inter node communication. Therefore they are able to utilize a large amount of nodes much more easily.
CAP theorem: 2 out of 3 comparison
www resources, part 1
- [!] noSQL data modelling techniques: http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/
- which noSQL to use: http://www.infoworld.com/d/data-management/which-freaking-database-should-i-use-199184
- CAP theorem - 2 out of 3: http://stackoverflow.com/questions/12346326/nosql-cap-theorem-availability-and-partition-tolerance
- why do we need noSQL: http://stackoverflow.com/questions/3713313/when-should-i-use-a-nosql-database-instead-of-a-relational-database-is-it-okay
- MySQL and memcache: end of an era: http://highscalability.com/blog/2010/2/26/mysql-and-memcached-end-of-an-era.html
- memcache and noSQL: http://natishalom.typepad.com/nati_shaloms_blog/2010/10/marrying-memcache-and-nosql.html
www resources, part 2
By Tomasz Ducin