PHC7065 CRITICAL SKILLS IN DATA MANIPULATION FOR POPULATION SCIENCE
NoSQL Databases 1
Hui Hu Ph.D.
Department of Epidemiology
College of Public Health and Health Professions & College of Medicine
February 12, 2017
NoSQL Databases and Semi-structured Data Models
Introduction to MongoDB
Lab: MongoDB 1
NoSQL Databases and Semi-structured Data Model
NoSQL
- NoSQL: "non SQL", "non relational" or "not only SQL"
- It provides a mechanism for storage and retrieval of data which is modeled in means other than the tabular relations used in relational databases
- NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query languages.
Types of NoSQL Databases
-
Column: Accumulo, Cassandra, Druid, HBase, Vertica, SAP HANA
-
Document: Apache CouchDB, ArangoDB, Clusterpoint, Couchbase, DocumentDB, HyperDex, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB
-
Key-value: Aerospike, ArangoDB, Couchbase, Dynamo, FairCom c-treeACE, FoundationDB, HyperDex, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, Berkeley DB
-
Graph: AllegroGraph, ArangoDB, InfiniteGraph, Apache Giraph, MarkLogic, Neo4J, OrientDB, Virtuoso, Stardog
- Multi-model: Alchemy Database, ArangoDB, CortexDB, Couchbase, FoundationDB, MarkLogic, OrientDB
Semi-structured Data
- A form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables
- It contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data
- Also known as self-describing structure
- The data model behind the web
html
- Semi-structured data are usually tree-structured
body
h1
p
p
li
li
Tree Data Structure
- Paper
- getParent -> document
- getChildren -> title, author, source
- getSibling -> report
- "June 12, 2000"
- root-to-node path: document/report/date/"June 12, 2000"
- Queries need tree navigation
- e.g. author of "XML query data model"
xml
- The entities belonging to the same class may have different attributes even though they are grouped together, and the attributes' order is not important
Extensive Markup Language
- Allows the querying of both schema and data
- What is the name of the element which contains a sub-element whose value is "CellType"?
JSON
JavaScript Object Notation
- Key-value pairs
- Tuple
- Arrays (indicated by square brackets)
[
{
_id: 1,
name: "john",
age: 21,
type: 1,
status: "A",
favorites: {artist:"Picasso",food:"pizza"},
finished: [17,3],
badges: ["blue","black"],
points: [
{points: 85, bonus: 10},
{points: 70, bonus: 5}
]
},
{
_id: 2,
name: "mike",
age: 20
}
]
One structure can be embedded in another structure
Introduction to MongoDB
MongoDB
- MongoDB (from "humongous") is a cross-platform document-oriented database.
- Classified as a NoSQL database, MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (BSON), making the integration of data in certain types of applications easier and faster
- SQL to MongoDB Mapping Chart
find()
SELECT
- MongoDB is a collection of documents
db.collection.find(<query filter>,<projection>).<cursor modifier>
Similar to FROM, specifies the collection to use
Similar to WHERE, specifies which documents to return
Projection variables in SELECT
How many results to return
Examples
- Query 1:
- SQL: SELECT * FROM Student
- MongoDB: db.Student.find()
- Query 2:
- SQL: SELECT a, b FROM Student
- MongoDB:
db.Student.find(
{},
{a:1,b:1,_id:0}
)
- Query 3:
- SQL: SELECT a FROM Student WHERE b="c"
- MongoDB:
db.Student.find( {b:"c"}, {a:1,_id:0} )
Examples (continued)
- Query 4:
- SQL: SELECT DISTINCT a FROM Student WHERE c>10
- MongoDB:
db.Student.find( {c:{$gt:10}},{a:1,_id:0} ).distinct(a)
- Some Operators of MongoDB:
$eq, $gt, $gte, $lt, $lte, $ne
$in, $nin
$or, $and, $not, $nor
- Query 5:
- count the number of students whose names have the partial string "joh" in it - must be case insensitive
- db.Student.find(name: {$regex: /joh/i}).count()
- Query 6:
- same query, but name starts with "Joh"
- db.Student.find(name: {$regex:/^Joh/}).count()
- Query 7:
- starts with "Joh" ends with "on"
- db.Student.count(name: {$regex:/^Jon.*on$/}
- Regular Expressions
Array Operations
- Find students who are taking courses "a" and "b"
- db.Student.find( {courses: {$in:["a","b"]} } )
- Find students who are not taking courses "a" and "c"
- db.Student.find( {courses: {$nin:["a","c"]} } )
-
Find the second and third elements of courses
- db.Student.find( {}, {courses: {$slice: [1,2]} } )
- db.Student.find( {}, {courses: {$slice: -2} } )
- Find a student whose second element in courses is "b"
- db.Student.find(courses.1:"b")
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}
Compound Statements
- Find students with age 20 or 21 and who are taking course "a" or "b" as their first course
- SELECT * FROM Students
WHERE (age=20 OR age=21) AND (course0="a" OR course0="b")
- db.Student.find({
$and:[
{$or: [{age:20},{age:21}] },
{$or: [{courses.0:"a"},{courses.0:"b"}] }
]
})
{_id:1,
name:"John",
age:20,
courses: ["a","b","c"]}
Queries over Nested Elements
_id:1,
points: [
{points: 90, bonus:10}
{points: 60, bonus:20}
]
_id:2,
points: [
{points: 53, bonus:20}
{points: 64, bonus:11}
]
_id:1,
points: [
{points: 99, bonus:17}
{points: 10, bonus:24}
]
- db.Points.find( {'points.0.points': {$lte: 80} } )
-
db.Points.find( {'points.points': {$lte: 80} } )
- db.Points.find( { 'points.points':{$lte:81}, 'points.bonus': {$lte 20} } )
MongoDB 1
git pull
https://docs.mongodb.com/manual/tutorial/install-mongodb-on-ubuntu/
PHC7065-Spring2018-Lecture5
By Hui Hu
PHC7065-Spring2018-Lecture5
Slides for Lecture 5, Spring 2018, PHC7065 Critical Skills in Data Manipulation for Population Science
- 668