PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE
NoSQL Databases
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
April 6, 2020
NoSQL Databases and Semi-structured Data Models
Introduction to MongoDB
Lab: MongoDB
NoSQL Databases and Semi-structured Data Model
NoSQL
- NoSQL: "non SQL", "non relational" or "not only SQL"
- It provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases
- NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.
Types of NoSQL Databases
-
Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
-
Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
-
Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB
-
Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
- Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB, MarkLogic, OrientDB
Semi-structured Data
- A form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables
- It contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data
- Also known as self-describing structure
- The data model behind the web
html
- Semi-structured data are usually tree-structured
body
h1
p
p
li
li
Tree Data Structure
- Paper
- getParent -> document
- getChildren -> title, author, source
- getSibling -> report
- "June 12, 2000"
- root-to-node path: document/report/date/"June 12, 2000"
- Queries need tree navigation
- e.g. author of "XML query data model"
xml
- The entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important
Extensive Markup Language
- Allows the querying of both schema and data
- What is the name of the element which contains a sub-element whose value is "CellType"?
JSON
JavaScript Object Notation
- Key-value pairs
- Tuple
- Arrays (indicated by square brackets)
[
{
_id: 1,
name: "john",
age: 21,
type: 1,
status: "A",
favorites: {artist:"Picasso",food:"pizza"},
finished: [17,3],
badges: ["blue","black"],
points: [
{points: 85, bonus: 10},
{points: 70, bonus: 5}
]
},
{
_id: 2,
name: "mike",
age: 20
}
]
One structure can be embedded in another structure
Introduction to MongoDB
MongoDB
- MongoDB (from "humongous") is a cross-platform document-oriented database.
- Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (BSON), making the integration of data in certain types of applications easier and faster
- SQL to MongoDB Mapping Chart
Connect to MongoDB using "mongolite" package in R
- MongoDB is a collection of documents
con <- mongo(collection="collectionName", db="dbName", url="url")
Similar to Table in relational databases
con <- mongo(collection="Students", db="phc7065", url="mongodb://localhost:27017")
- Example:
find()
SELECT
con$find()
- Query 1:
- SQL: SELECT * FROM Student
- MongoDB: con$find()
- Query 2:
- SQL: SELECT a, b FROM Student
- MongoDB:
con$find(
{},
fields='{"a":1,"b":1,"_id":0}'
)
- Query 3:
- SQL: SELECT a FROM Student WHERE b="c"
- MongoDB:
con$find( '{"b":"c"}', fields='{"a":1,"_id":0}' )
Examples (continued)
- Query 4:
- SQL: SELECT DISTINCT a FROM Student WHERE c>10
- MongoDB:
unique(con$find( '{"c":{"$gt":10}}',fields: '{"a":1,"_id":0}' ))
- Some Operators of MongoDB:
$eq, $gt, $gte, $lt, $lte, $ne
$in, $nin
$or, $and, $not, $nor
Array Operations
- Find students who are taking courses "a" and "b"
- con$find( '{"courses": {"$in":["a","b"]} }' )
- Find students who are not taking courses "a" and "c"
- con$find( '{"courses": {"$nin":["a","c"]} }' )
-
Find the second and third elements of courses
- con$find( '{}', fields='{"courses": {"$slice": [1,2]} }' )
- con$find( '{}', fields='{"courses": {"$slice": -2} }' )
- Find a student whose second element in courses is "b"
- con$find('{"courses.1":"b"}')
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}
Compound Statements
- Find students with age 20 or 21 and who are taking course "a" or "b" as their first course
- SELECT * FROM Students
WHERE (age=20 OR age=21) AND (course0="a" OR course0="b")
- con$find('{
"$and":[
{"$or": [{"age":20},{"age":21}] },
{"$or": [{"courses.0":"a"},{"courses.0":"b"}] }
]
}')
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}
Queries over Nested Elements
_id:1,
points: [
{points: 90, bonus:10}
{points: 60, bonus:20}
]
_id:2,
points: [
{points: 53, bonus:20}
{points: 64, bonus:11}
]
_id:1,
points: [
{points: 99, bonus:17}
{points: 10, bonus:24}
]
- con$find( '{"points.0.points": {"$lte": 80} }' )
-
con$find( '{"points.points": {"$lte": 80} }' )
- con$find( '{ "points.points":{"$lte":80}, "points.bonus": {"$lte": 20} }' )
Counting and Distinct
con$count('{"age":{"$exists":true}}')
SELECT COUNT(age) FROM Students
SELECT COUNT(DISTINCT age) FROM Students
length(con$distinct('age'))
Aggregation
Aggregation
con$aggregate('[
{"$group":{"_id":"$department","meanAge":{"$avg":"$age"}}}
]')
SELECT MEAN(age) FROM Students GROUP BY department;
SELECT MEAN(age) FROM Students
WHERE course0='a' OR course1='a' OR course2='a'
GROUP BY department;
con$aggregate('[
{"$match":{"courses":{"$in":["a"]}}}
{"$group":{"_id":"$department","meanAge":{"$avg":"$age"}}}
]')
Multi-attribute Aggregation
con$aggregate('[
{
"$group":{
"_id":{"department":"$department","college":"$college"},
"meanAge":{"$avg":"$age"}
}
}
]')
SELECT MEAN(age) FROM Students GROUP BY department, college;
Aggregation with Text Search
con$aggregate('[
{"$match":{"$text":{"$search":"Miami Florida"}}},
{"$sort":{"score":{"$meta":"textScore"}}},
{"$project":{"hometown":1,"name":1}}
]')
Join
Join
Colleges
Students
con$aggregate('[
{
"$lookup":{
"from":"Students",
"localField":"name",
"foreignField":"college",
"as":"students"
}
}
]'')
con$aggregate('[
{
"$lookup":{
"from":"Colleges",
"localField":"college",
"foreignField":"name",
"as":"colleges"
}
}
]')
MongoDB
git pull
PHC7065-Spring2020-Lecture8
By Hui Hu
PHC7065-Spring2020-Lecture8
Slides for Lecture 8, Spring 2020, PHC7065 Critical Skills in Data Manipulation for Population Science
- 891