Middleware demystified

Onto datagrids...

Vineet Reynolds
JBoss Developer Experience, Red Hat

Agenda

  • Shared Everything vs Shared Nothing database architectures
  • An overview of Data Grids
  • Impact on storage

Let's talk about databases first

General Relational Database Structure

Shared Everything database architectures

Dewitt, David; Gray, Jim (1992). "Parallel database systems: The future of high performance database systems"

Shared Disk database architectures

  • We won't discuss shared memory designs - does not aid a lot in understanding data grids.
  • Nodes have their own memory, and share mass storage.
  • Disk interconnect may be achieved via SANs.
  • OLTP databases like Oracle, IBM DB2, MySQL, PostgreSQL are architected this way.

Issues with shared disk architectures

  • Shared storage becomes a single point of contention. Limits horizontal scaling.
  • Improved performance is achieved through vertical scaling. Faster nodes == better performance.
  • Consistent writes require :
    • disk based lock tables
    • or synchronization among individual nodes.
    because of the possibility of lost updates.

Shared Nothing architectures

Dewitt, David; Gray, Jim (1992). "Parallel database systems: The future of high performance database systems"

Characteristics

    In shared-nothing architectures:
  • Nodes exhibit independence and self-sufficiency with no single point of contention.
  • Horizontal scaling is done by adding more nodes and typically does not come at the cost of performance.
  • A node can typically operate only on data available locally - leads to data partitioning techniques.

Comparison with shared-disk architectures

  • Shared-nothing architectures incur data-shipping problems operating on data spanning multiple nodes. (Think access plans for queries spanning nodes or involving joins)
  • Without good data affinity or partitioning techniques, loads will not be uniformly distributed.
  • Without data replication, nodes are single points of failure.

Defining Data Grids

    Most people know what a database is. Very few know what a data grid is. They are not:
  • in-memory relational databases, or
  • a simple data caching solution, or
  • or even a NoSQL database.
A Data Grid is a system composed of multiple servers that work together to manage information and related operations – such as computations – in a distributed environment.
Cameron Purdy (2008). Defining a Data Grid
Data grids are distributed databases designed for scalability having the characteristics of shared-nothing architectures.
Note the lack of a 'relational' qualifier for the database. Data grids typically store objects not tuples.
In-memory data grids store data in-memory for fast access to large volumes of data.

Data grids and storage

Design constraints imposed on storage systems are not consistent. The constraints depend on the implementation of the data grid and the applications accessing them.

Online and offline writes

    Data grids can be instructed to write to disk* in an online modes. This is usually a data grid capability.
  • Online writes == Write-through mode. Clients block until the write is complete.
  • Offline write == Write-behind mode. Clients don't block for the write to complete. The write is completed in the background.
Disk is synonymous for block I/O devices as well as a DBMS. JBoss Infinispan (an in-memory data grid) can use files, RDBMS and other stores for passivation.

Modelling IO operations - #1

    Data grids modelled after traditional databases will typically:
  • perform an write to disk when objects in the grid are updated.
  • evict objects (remove from memory + no local disk write) from the grid based on policies - LRU, LIRS etc.
  • expire objects (remove from memory + cluster disk write) from data grid based on policies.
  • read objects from memory if available, else read from disk.

Modelling IO operations - #2

    Considering evolution in data grids, data grid implementations may evolve to:
  • perform writes to disk occur only during eviction!
  • ensure HA through replication.
Kallman, Robert; et al (2008). H-Store: A High-Performance, Distributed Main Memory Transaction Processing System

Questions ?

Middleware demystified - datagrids

By vineetreynolds

Middleware demystified - datagrids

  • 1,165