PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE

NoSQL Databases

Hui Hu Ph.D.

Department of Epidemiology

College of Public Health and Health Professions & College of Medicine

April 6, 2020

NoSQL Databases and Semi-structured Data Models
 

Introduction to MongoDB

 

Lab: MongoDB

NoSQL Databases and Semi-structured Data Model

NoSQL

  • NoSQL: "non SQL", "non relational" or "not only SQL"
     
  • It provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases
     
  • NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.

Types of NoSQL Databases

  • Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
     
  • Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
     
  • Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB
     
  • Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
     
  • Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB, MarkLogic, OrientDB

Semi-structured Data

  • A form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables
     
  • It contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data
     
  •  Also known as self-describing structure
     
  • The data model behind the web

html

  • Semi-structured data are usually tree-structured

body

h1

p

p

li

li

Tree Data Structure

  • Paper
    -  getParent -> document
    -  getChildren -> title, author, source
    -  getSibling -> report
     
  • "June 12, 2000"
    -  root-to-node path: document/report/date/"June 12, 2000"
     
  • Queries need tree navigation
    -  e.g. author of "XML query data model"

xml

  • The entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important

Extensive Markup Language

  • Allows the querying of both schema and data
    -  What is the name of the element which contains a sub-element whose value is "CellType"?

JSON

JavaScript Object Notation

  • Key-value pairs
  • Tuple
  • Arrays (indicated by square brackets)
[
    {
     _id: 1,
     name: "john",
     age: 21,
     type: 1,
     status: "A",
     favorites: {artist:"Picasso",food:"pizza"},
     finished: [17,3],
     badges: ["blue","black"],
     points: [
         {points: 85, bonus: 10},
         {points: 70, bonus: 5}
     ]
    },
    {
     _id: 2,
     name: "mike",
     age: 20
    }
]

One structure can be embedded in another structure

Introduction to MongoDB

MongoDB

  • MongoDB (from "humongous") is a cross-platform document-oriented database.
  • Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (BSON), making the integration of data in certain types of applications easier and faster
  • SQL to MongoDB Mapping Chart

Connect to MongoDB using "mongolite" package in R

  • MongoDB is a collection of documents
con <- mongo(collection="collectionName", db="dbName", url="url")

Similar to Table in relational databases

con <- mongo(collection="Students", db="phc7065", url="mongodb://localhost:27017")
  • Example:

find()

SELECT

con$find()
  • Query 1:
    -  SQL: SELECT * FROM Student
    -  MongoDB: con$find()
     
  • Query 2:
    -  SQL: SELECT a, b FROM Student
    -  MongoDB:
       con$find(
         {},
         fields='{"a":1,"b":1,"_id":0}'
       )
     
  • Query 3:
    -  SQL: SELECT a FROM Student WHERE b="c"
    -  MongoDB:
       con$find( '{"b":"c"}', fields='{"a":1,"_id":0}' )

Examples (continued)

  • Query 4:
    -  SQL: SELECT DISTINCT a FROM Student WHERE c>10
    -  MongoDB:
       unique(con$find( '{"c":{"$gt":10}}',fields: '{"a":1,"_id":0}' ))
     
  • Some Operators of MongoDB:
    $eq, $gt, $gte, $lt, $lte, $ne
    $in, $nin
    $or, $and, $not, $nor

Array Operations

  • Find students who are taking courses "a" and "b"
    -  con$find( '{"courses": {"$in":["a","b"]} }' )
     
  • Find students who are not taking courses "a" and "c"
    -  con$find( '{"courses": {"$nin":["a","c"]} }' )
     
  • Find the second and third elements of courses
    -  con$find( '{}', fields='{"courses": {"$slice": [1,2]} }' )
    -  con$find( '{}', fields='{"courses": {"$slice": -2} }' )

     
  • Find a student whose second element in courses is "b"
    -  con$find('{"courses.1":"b"}')
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}

Compound Statements

  • Find students with age 20 or 21 and who are taking course "a" or "b" as their first course
     
  • SELECT * FROM Students
    WHERE (age=20 OR age=21) AND (course0="a" OR course0="b")
     
  •    con$find('{
       "$and":[
          {"$or": [{"age":20},{"age":21}] },
          {"$or": [{"courses.0":"a"},{"courses.0":"b"}] }
       ]
       }')
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}

Queries over Nested Elements

_id:1,
    points: [
        {points: 90, bonus:10}
        {points: 60, bonus:20}
    ]
_id:2,
    points: [
        {points: 53, bonus:20}
        {points: 64, bonus:11}
    ]
_id:1,
    points: [
        {points: 99, bonus:17}
        {points: 10, bonus:24}
    ]
  • con$find( '{"points.0.points": {"$lte": 80} }' )
     
  • con$find( '{"points.points": {"$lte": 80} }' )
     
  • con$find( '{ "points.points":{"$lte":80}, "points.bonus": {"$lte": 20} }' )
     

Counting and Distinct

con$count('{"age":{"$exists":true}}')
SELECT COUNT(age) FROM Students
SELECT COUNT(DISTINCT age) FROM Students
length(con$distinct('age'))

Aggregation

Aggregation

con$aggregate('[
    {"$group":{"_id":"$department","meanAge":{"$avg":"$age"}}}
]')
SELECT MEAN(age) FROM Students GROUP BY department;
SELECT MEAN(age) FROM Students 
WHERE course0='a' OR course1='a' OR course2='a' 
GROUP BY department;
con$aggregate('[
    {"$match":{"courses":{"$in":["a"]}}}
    {"$group":{"_id":"$department","meanAge":{"$avg":"$age"}}}
]')

Multi-attribute Aggregation

con$aggregate('[
    {
     "$group":{
        "_id":{"department":"$department","college":"$college"},
        "meanAge":{"$avg":"$age"}
        }
    }
]')
SELECT MEAN(age) FROM Students GROUP BY department, college;

Aggregation with Text Search

con$aggregate('[
    {"$match":{"$text":{"$search":"Miami Florida"}}},
    {"$sort":{"score":{"$meta":"textScore"}}},
    {"$project":{"hometown":1,"name":1}}
]')

Join

Join

Colleges

Students

con$aggregate('[
    {
     "$lookup":{
        "from":"Students",
        "localField":"name",
        "foreignField":"college",
        "as":"students"
      }
    }
]'')
con$aggregate('[
    {
     "$lookup":{
        "from":"Colleges",
        "localField":"college",
        "foreignField":"name",
        "as":"colleges"
      }
    }
]')

MongoDB

git pull

PHC7065-Spring2020-Lecture8

By Hui Hu

PHC7065-Spring2020-Lecture8

Slides for Lecture 8, Spring 2020, PHC7065 Critical Skills in Data Manipulation for Population Science

  • 785