Social and Political Data Science: Introduction

Information Management

Karl Ho

School of Economic, Political and Policy Sciences

University of Texas at Dallas

Introduction

Why takes this course

Source: Khuyen Tran. 2021. What skills and backgrounds do data scientists have in common?

Not the salary

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

SKS Book structure

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

SKS Book structure

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

SKS Book structure

Part One Relational Languages

Part Two Database Design

Part Three Application Design and Development

Part Four Big Data Analytics

Part Five Storage Management and Indexing

Part Six Query Processing and Optimization

Part Seven Transaction Management

Part Eight Parallel and Distributed Databases

Part Nine Advanced Topics

Part Ten Appendix A

Part Eleven Online Chapters

Class structure

Topic	Reading
Introduction	IF 1, 4
Relational Languages	SKS 1, 2
SQL I	SKS 3, 4
SQL II	SKS 5
ERM and Relational Database Design	SKS 6, 7
Complex Data Types	SKS 8
Application Design	SKS 9
Big Data	SKS 10; IF 5
Data Analytics	SKS 11
Query Processing	SKS 15
Database-System Architectures	SKS 20
Parallel and Distributed Systems	SKS 21-22
Blockchain Databases	SKS 26
New Developments in Data/Information systems	SKS 28, 29, 30, 32; Beckman reports 2013, 2016

Topic	Reading
Introduction	IF 1, 4
Relational Languages	SKS 1, 2
SQL I	SKS 3, 4
SQL II	SKS 5
ERM and Relational Database Design	SKS 6, 7
Complex Data Types	SKS 8
Application Design	SKS 9
Big Data	SKS 10; IF 5
Data Analytics	SKS 11
Query Processing	SKS 15
Database-System Architectures	SKS 20
Parallel and Distributed Systems	SKS 21-22
Blockchain Databases	SKS 26
New Developments in Data/Information systems	SKS 28, 29, 30, 32; Beckman reports 2013, 2016

Types of Data (by class)

Quantitative Data: Numerical. e.g. height, weight
1. INTeger
2. NUMeric
Categorical Data: Data that can be labeled or divided into groups. e.g. race, sex, hair color.
1. Factor
Text Data
1. CHARacter
Logical Data
1. TRUE/FALSE

Types of Data (by structure)

Structured: Data that has predefined structures. e.g. tables, spreadsheets, or relational databases.
Unstructured Data: Data with no predefined structure, comes in any size or form, cannot be easily stored in tables. e.g. blobs of text, images, audio

A database is a structured collection of data about entities and their relationships.

Database

Object - entities (e.g. individuals, organization, etc.)
Relationship - (John works in WTO)

Computer Science
Business/Management
Physical sciences
Policy Sciences
Social Sciences

Database use

The university database example will be used to illustrate all the concepts.

Data consists of information about:
- Students
- Instructors
- Classes
Application program examples:
- Add new students, instructors, and courses
- Register students for courses, and generate class rosters
- Assign grades to students, compute grade point averages (GPA) and generate transcripts

SKS example: University Database

1950s and early 1960s
1. Data processing using magnetic tapes for storage
2. Tapes provided only sequential access
3. Punched cards for input

History of Database Systems

Late 1960s and 1970s
1. Hard disks allowed direct access to data
2. Network and hierarchical data models in widespread use
  - Edgar (Ted) Frank Codd defines the relational data model
  - IBM Research begins System R prototype
  - UC Berkeley (Michael Stonebraker) begins Ingres prototype
  - Oracle releases first commercial relational database
  - High-performance (for the era) transaction processing

History of Database Systems

1980s
1. Research relational prototypes evolve into commercial systems
  1. SQL becomes industrial standard
2. Parallel and distributed database systems
  1. Wisconsin, IBM, Teradata
3. Object-oriented database systems

History of Database Systems

1990s
1. Large decision support and data-mining applications
2. Large multi-terabyte data warehouses
3. Emergence of Web commerce

History of Database Systems

2000s
1. Big data storage systems
  1. Google BigTable, Yahoo PNuts, Amazon,
  2. “NoSQL” systems.
2. Big data analysis: beyond SQL
  1. Map reduce and friends

History of Database Systems

2010s
1. SQL reloaded
  1. SQL front end to Map Reduce systems
  2. Massively parallel database systems
  3. Multi-core main-memory databases

History of Database Systems

“Those that do not understand the mistakes of their ancestors will end up repeating them”

- Michael Stonebraker

Edgar Frank Codd (1923-2003)

Most known as Ted Codd
Turing Award (1981)
British born American computer scientist
Education: Oxford, Michigan
Work: IBM
Invented \("Relational"\) database when working in IBM in the early 1970s
Famous paper:
"A Relational Model of Data for Large Shared Data Banks"

https://history.computer.org/pioneers/codd.html

Relational model

All the data is stored in various tables.
Example of tabular data in the relational model

Relational model

Non-nested tables

Michael Stonebraker

Turing award laureate (2014)
Education: Princeton, Michigan
Work: UC Berkeley, Michigan, MIT
Create:
- INGRES
- PostGRES
- SciDB
Founded 9 startups
Current startup:
- paradigm4 (SciDB)
Famous publication:
The Red Book (Readings in Database Systems)

PostgreSQL

Ingress means: a place or means of access, an entrance. POSTGRES, that is the successor to the INGRES relational database system.

Oracle

SciDB

Database types

Relational DBMS
Key-value stores
Document stores
Graph DBMS
Wide-column stores
Search engines
Time series DBMS

Key-value store

A key-value storage system (or key-value store) is a system that provides a way to store or update a record (value) with an associated key and to retrieve the record with a given key.

Document store

Document store is a semi-structured, document-oriented database designed and optimized to store and work with XML documents such as MongoDB.

Abstraction allows complexity to be manageable even without full knowledge of the system.

Concept of Abstraction

Abstract view of the information enables users and application programmers to administer the database without understanding the full details of how data are stored and organized.

A database system is a collection of interrelated data and a set of programs that allow users to access and modify these data. A major purpose of a database system is to provide users with an abstract view of the data. That is, the system hides certain details of how the data are stored and maintained.

Abstract view of database

A major purpose of a database system is to provide users with an abstract view of the data
- Data models
  - A collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.
- Data abstraction
  - Hide the complexity of data structures to represent data in the database from users through several levels of data abstraction.

Abstract view of database

Architecture of database

SKS, p. 10

Physical level: describes how a record is stored.

Logical level: describes data stored in database, and the relationships among the data.

View level: Views can hide information (such as person's ID) for security purposes.

a collection of conceptual tools for describing data, data relationships, data semantics, and consistency constraints.
- Relational Model
- Entity-Relationship Model
- Semi-structured Data Model
- Object-Based Data Model
- Semi-structured data model (XML)
- Other older models:
  - Network model
  - Hierarchical model

Data model

A data model specifies the data elements associated with a problem domain, the properties of those data elements, and how those data elements relate to one another.

Data model

In developing a data model, we commonly first identify the entities that are to be modeled and then define their properties and relationships.

Data model

For example, the entities include individuals, institutions, parties, each of which has various properties (e.g., for individuals, name, address, employer)

The relationships include “is employed by” and “support” This conceptual data model can then be translated into relational tables or some other database representation.

A table (also referred to as a relation) is a set of rows (also referred to as tuples, records, or observations), each with the same columns (also referred to as fields, attributes or variables). A database consists of multiple tables.

Data table

There are over 300 database system in the market, according to DB Engines. The top three are:

Database

Oracle
MySQL
Microsoft SQL Server

In the early days, database applications were built directly on top of file systems.

File-processing system

What could be the problem for keeping records in files and steel cabinets?

https://www.databankimx.com/

https://history.nasa.gov/computers/Ch8-2.html

Disadvantages

Data redundancy and inconsistency
Difficulty in Data access
Data isolation
Integrity
Atomicity
Concurrent-access anomalies
Security

File-processing system

Data redundancy and inconsistency:
- data is stored in multiple file formats resulting in duplication of information in different files
Difficulty in accessing data
- Need to write a new program to carry out each new task
Data isolation
- Multiple files and formats
Integrity problems
- Integrity constraints (e.g., account balance > 0) become “buried” in program code rather than being stated explicitly
- Hard to add new constraints or change existing ones

File-processing system

Atomicity of updates
- Failures may leave database in an inconsistent state with partial updates carried out
- Example: Transfer of funds from one account to another should either complete or not happen at all

File-processing system

Atomicity:

The process must happen in its entirety or not at all.
e.g. fund transfer from one account to the other account

Concurrent access by multiple users
- Concurrent access needed for performance
- Uncontrolled concurrent accesses can lead to inconsistencies
  - E.g. Two people reading a balance (say 100) and updating it by withdrawing money (say 50 each) at the same time
Security problems
- Hard to provide user access to some, but not all, data

File-processing system

A database management system is a software suite designed to safely store and efficiently manage databases, and to assist with the maintenance and discovery of the relationships that database represents.

Database Management Systems (DBMS)

DBMS contains information about a particular enterprise
- Collection of interrelated data
- Set of programs to access the data
- An environment that is both convenient and efficient to use
Database systems are used to manage collections of data that are:
- Highly valuable
- Relatively large
- Accessed by multiple users and applications, often at the same time.
Databases touch all aspects of our lives

Database Management Systems (DBMS)

A database management system coordinates both the physical and the logical access to the data, whereas a file-processing system coordinates only the former
DBMS reduces the amount of data duplication by ensuring that a physical piece of data is available to all programs authorized to have access to it, whereas file-processing system is for one program only.

File-processing system vs. DBMS

DBMS allows flexible access to data (i.e., queries), whereas a file-processing system does pre-determined access to data (i.e., compiled programs).
DBMS can coordinate multiple users accessing the same data at the same time. A file-processing system is usually designed to allow one or more programs to access different data files at the same time. In a file-processing system, a file can be accessed by two programs concurrently only if both programs have read-only access to the file.

File-processing system vs. DBMS

Three key components:

Data model
- defines how data are represented
Query language
- defines how the user interacts with the data
Transactions and crash recovery
- to ensure reliable execution despite system failures

Database Management Systems (DBMS)

Five responsibilities:

interaction with the file manager
integrity enforcement
security enforcement
backup and recovery
concurrency control

Database Management Systems (DBMS)

Key components of DBMS

Foster et al. 2016, p. 97

Database types

Foster et al. 2016, p. 97

Foster et al. 2016, p. 99

ACID: Atomicity, Consistency, Isolation, Durability

Relational Database and NoSQL

Foster et al. 2016, p. 100

document-oriented database
key-value database
column-oriented database
graph database
Hadoop system

NoSQL

document-oriented database
key-value database
column-oriented database
graph database
Hadoop system

What's new?

PostgreSQL is an open-source object-relational database management system, first developed by Michael Stonebraker at the University of California, Berkeley.
The name “postgres” is derived from the name of a pioneering relational database system, Ingres, also developed by Stonebraker.
PostgreSQL supports many aspects of SQL:2003 and offers features such as complex queries, foreign keys, triggers, views, transactional integrity, full-text searching, and limited data replication.
Extensible with new data types, functions, operators, or index methods.
PostgreSQL supports a variety of programming languages (including C, C++, Java, Perl, Tcl, and Python) as well as the database interfaces JDBC and ODBC.

PostgreSQL

Cassandra is an open-source, distributed NoSQL database management system by Apache.

Cassandra

The wide column store is designed to handle large amounts of data across many commodity servers. It is built with Java so it is cross-platform.

When designing a data structure, we have to think about how to manage its growth and the possible implications of the chosen technique.

Database design tips:

A sample relational database

SKS, p. 9

Query processing

SKS, p. 689

Query processing

SKS, p. 689

Query processing

Query processing refers to the range of activities involved in extracting data from a database including:

translation of queries in high-level database languages into expressions that can be used at the physical level of the file system
variety of query-optimizing transformations
actual evaluation of queries.

Steps in query processing:

Parsing and translation.
Optimization.
Evaluation.

Information Management 2025: Introduction

By Karl Ho

Information Management 2025: Introduction

Information Management: Introduction

3 months ago
166

Karl Ho

Data Generation datageneration.io

Information Management

Introduction

Why takes this course

SKS Book structure

SKS Book structure

SKS Book structure

Class structure

Types of Data (by class)

Quantitative Data: Numerical. e.g. height, weight

Categorical Data: Data that can be labeled or divided into groups. e.g. race, sex, hair color.

Text Data

Logical Data

Types of Data (by structure)

Structured: Data that has predefined structures. e.g. tables, spreadsheets, or relational databases.

Unstructured Data: Data with no predefined structure, comes in any size or form, cannot be easily stored in tables. e.g. blobs of text, images, audio

A database is a structured collection of data about entities and their relationships.

Database

Object - entities (e.g. individuals, organization, etc.)

Relationship - (John works in WTO)

Computer Science

Business/Management

Physical sciences

Policy Sciences

Social Sciences

Database use

SKS example: University Database

1950s and early 1960s

Data processing using magnetic tapes for storage

Tapes provided only sequential access

Punched cards for input

History of Database Systems

Late 1960s and 1970s

Hard disks allowed direct access to data

Network and hierarchical data models in widespread use

History of Database Systems

1980s

Research relational prototypes evolve into commercial systems

SQL becomes industrial standard

Parallel and distributed database systems

Wisconsin, IBM, Teradata

Object-oriented database systems

History of Database Systems

1990s

Large decision support and data-mining applications

Large multi-terabyte data warehouses

Emergence of Web commerce

History of Database Systems

2000s

Big data storage systems

Google BigTable, Yahoo PNuts, Amazon,

“NoSQL” systems.

Big data analysis: beyond SQL

Map reduce and friends

History of Database Systems

2010s

SQL reloaded

SQL front end to Map Reduce systems

Massively parallel database systems

Multi-core main-memory databases

History of Database Systems

History of Database Systems

“Those that do not understand the mistakes of their ancestors will end up repeating them”

Edgar Frank Codd (1923-2003)

Relational model

Relational model

Michael Stonebraker

PostgreSQL

Oracle

SciDB

Database types

Relational DBMS

Key-value stores

Document stores

Graph DBMS

Wide-column stores

Search engines

Time series DBMS

Key-value store

A key-value storage system (or key-value store) is a system that provides a way to store or update a record (value) with an associated key and to retrieve the record with a given key.

Document store