relations are not rigid (not schema-based: documents, graphs, wide-column, key-value)
or
transactions are not the main goal by itself (who gives the fuck about ACID?)
or
data doesn't have to be consistent throughout the system all the time
why do we need noSQL?
Relational databases enforce ACID rules. So, you will have schema-based transaction-oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
some of they say that...
the rise of noSQL can be explained by CAP Theorem and the inability of the MySQL database to scale consistently
set of properties that guarantee that database transactions are processed reliably
transaction = a single logical operation on the data (including multiple changes)
transaction example: transfer of funds from one bank account to another (lower the FROM account and raise the TO account)
A for Atomicity
Atomicity requires that each transaction is "all or nothing"
failure example
Subtract value of x from A and add it to B. Substracting from A has succeeded, but the system is unable to add the x value to B.
C for Consistency
The consistency property ensures that any transaction will bring the database from one valid state to another valid state.
failure example
Consistency is a very general term which demands that the data must meet all validation rules. Assume that transaction is going to subtract x from A without adding x to B. Before the transaction, A + B = y. After the transaction, A + B = y - 10.
or imagine a simple 1:n relation with a ON UPDATE/DELETE CASCADE that doesn't work properly. We'd end up with a Comment row related to a non-existent BlogPost row.
I for Isolation
The isolation property ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed serially, i.e. one after the other.
failure example
T[1] subtracts 10 from A
T[1] adds 10 to B
T[2] subtracts 10 from B
T[2] adds 10 to A
If T[2] waits for T[1] to be completed, everything is fine. But if it doesn't, consider that T[1] fails after subtracting and T[2] comes inbetween...
D for Durability
Come what may, our data is safe.
Durability means that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.
failure example
x is subtracted from A, everything is saved. After x has been added to B, the system returned "OK" message and stored the operation in the cache (RAM). Meanwhile, an electricity failure went out.
Eric Brewer stated it in 2000 during Symposium on Principles of Distributed Computing:
it is impossible for a distributed computer system to simultaneously provide all three of the following guarantees:
Consistency (aCid - all DBMS nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it was successful or failed)
Partition Tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
Network Distributed Systems
CAP theorem applies to distributed DBMS used by big network systems (that's why we talk about Facebook, Google, etc).
If we are considering single-machine DBMS, Partition Tolerance is irrelevant. Consistency and Availablity are easly achievable on one machine.
2 out of 3
in other words...
Consistency means that data is the same across the cluster, so you can read or write to/from any node and get the same data.
Availability means the ability to access the cluster even if a node in the cluster goes down.
Partition Tolerance means that the cluster continues to function even if there is a "partition" (communications break) between two nodes (both nodes are up, but can't communicate).
combinations
CA - data is consistent between all nodes - as long as all nodes are online - and you can read/write from any node and be sure that the data is the same, but if you ever develop a partition between nodes, the data will be out of sync (and won't re-sync once the partition is resolved)
CP - data is consistent between all nodes, and maintains partition tolerance (preventing data desync) by becoming unavailable when a node goes down
AP - nodes remain online even if they can't communicate with each other and will resync data once the partition is resolved, but you aren't guaranteed that all nodes will have the same data (either during or after the partition)
Andrew Oliver, InfoWorld:
Part of the reason there are so many different types of NoSQL databases lies in the CAP theorem, aka Brewer’s Theorem. The CAP theorem states you can provide only two out of the following three characteristics: consistency, availability, and partition tolerance.
Different datasets and different runtime rules cause you to make different trade-offs. Different database technologies focus on different trade-offs. The complexity of the data and the scalability of the system also come into play.
Relational databases are based on relational algebra, which is more or less an outgrowth of set theory. Relationships based on set theory are effective for many datasets, but where parent-child or distance of relationships are required, set theory isn’t very effective. You may need graph theory to efficiently design a data solution.
In other words, relational databases are overkill for data that can be effectively used as key-value pairs and underkill for data that needs more context. Overkill costs you scalability; underkill costs you performance.
BASE
an alternative to ACID:
Basically Available
Soft state (panta rei)
Eventual consistency (given a sufficiently long period of time over which no changes are sent, all updates can be expected to propagate eventually through the system and the replicas will be consistent)
stands in contrast to other, more traditional relational databases whose data has ACID
BASE: why ACID alternative?
Soft State. BASE is diametrically opposed to ACID. Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux. Although this sounds impossible to cope with, in reality it is quite manageable and leads to levels of scalability that cannot be obtained with ACID.
Basic Availability. The availability of BASE is achieved through supporting partial failures without total system failure. Here is a simple example: if users are partitioned across five database servers, BASE design encourages crafting operations in such a way that a user database failure impacts only the 20 percent of the users on that particular host. There is no magic involved, but this does lead to higher perceived availability of the system.
BASE vs ACID by means of CAP
CAP is basically a continuum along which BASE and ACID are on opposite ends.
CAP is Consistency, Availability, and Partition tolerance. Basically you can pick 2 of those but you can't do all 3.
ACID focuses on Consistency and Availability.
BASE focuses on Partition Tolerance and Availability and throws consistency out the window.
CAP as ACID/BASE continuum
You can decide how close you want to be to one end of the continuum or the other according to your priorities.
CouchDB was designed with bi-direction replication (or synchronization) and off-line operation in mind. That means multiple replicas can have their own copies of the same data, modify it, and then sync those changes at a later time.
NoSQL solutions are usually meant to solve a problem that:
RDBMS are either not well suited for
RDBMS are too expansive to use (Oracle)
require you to implement something that breaks the relational nature of your db anyway
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS, there is no reason to choose NoSQL.
when to use noSQL
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on a high traffic site
the data makes no sense in SQL, you find yourself doing multiple JOIN queries for accessing some piece of information
you are breaking the relational model, you have CLOBs that store denormalized data and you generate external indexes to search that data
It is perfectly fine to use any technology in conjunction with another, e.g. MongoDB and MySQL work fine together as long as they aren't on the same machine
Scalability
Typical RDBMSs make strong guaranties about consistency. This requires to some extend communication between nodes for every transaction. This limites the ability to scale out, because more nodes means more communication.
NoSql systems make different trade offs. For example they don't guarantee that a second session will see immediately data commited by a first session. Thereby decoupling the transaction of storing some data from the process of making that data available for every user (eventually consistent). So a single transaction doesn't need to wait for any (or for much less) inter node communication. Therefore they are able to utilize a large amount of nodes much more easily.