Data

You keep using that word...

Avishai Ish-Shalom (@nukemberg)

Define "Data"

My definition of Data

A Representation of information in a computer storable form

Data is

  • Bounded, finite
  • Discrete (resolution)
  • Structured
  • Has known format

Data is a snapshot of information

How does information become data?

Data Modeling

Data modeling

The process of defining how to extract data from real world information

Data modeling is lossy

What do we lose?

  • Everything we didn't collect
  • Accuracy
  • Context

Data modeling is biased

  • We select a subset of information

  • The conversion is also biased

Formats matter

Let me tell you a story

Oops

Let's model a person

Names

  • First name, last name
  • Middle name?
  • How long?

Longest name in Africa

Uvuvwevwevwe onyetenvewve ugwemubwem ossas

Longest name in the world

Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Zeus Wolfe­schlegel­stein­hausen­berger­dorff­welche­vor­altern­waren­gewissen­haft­schafers­wessen­schafe­waren­wohl­gepflege­und­sorg­faltig­keit­be­schutzen­vor­an­greifen­durch­ihr­raub­gierig­feinde­welche­vor­altern­zwolf­hundert­tausend­jah­res­voran­die­er­scheinen­von­der­erste­erde­mensch­der­raum­schiff­genacht­mit­tung­stein­und­sieben­iridium­elek­trisch­motors­ge­brauch­licht­als­sein­ur­sprung­von­kraft­ge­start­sein­lange­fahrt­hin­zwischen­stern­artig­raum­auf­der­suchen­nach­bar­schaft­der­stern­welche­ge­habt­be­wohn­bar­planeten­kreise­drehen­sich­und­wo­hin­der­neue­rasse­von­ver­stand­ig­mensch­lich­keit­konnte­fort­pflanzen­und­sicher­freuen­an­lebens­lang­lich­freude­und­ru­he­mit­nicht­ein­furcht­vor­an­greifen­vor­anderer­intelligent­ge­schopfs­von­hin­zwischen­stern­art­ig­raum, Senior.

Mr. Kim

Mr. Un

Or

?

Names are hard

  • Cultural differences
  • Can change
  • Not unicode
  • Min length? max length?

Age?

  • Changes every year
  • Unless it's a dead person
  • Min? Max?
  • Why int?

Gender?

  • Physiological
  • Social
  • Legal
  • Definitely not binary

Adresses?

  • Does everyone have one?
  • Zip codes
  • Can change
  • Very quirky

Country?

  • When?
  • Depends who you ask...

Identity?

Dates

Calendars

  • Civil
  • Gregorian
  • Jewish
  • Muslim
  • etc

Timezones

  • Start/end changes
  • Zones change
  • Can geographically overlap
  • Some dates don't exist

UTC?

  • Day or night?
  • What day was it?

When Time itself goes wrong

  • 24 hours in a day (daylight savings)
  • 365 days in a year (leap year)
  • 60 seconds in a minute (leap second)
  • Infinite (Y2K anyone?)
  • Monotonic (clock sync)

Numbers

  • Integers/Floats
  • Signed/Unsigned
  • Resolution
  • Bounds

Integers

scala> n.getClass
res3: Class[Int] = int

scala> n > 0
res4: Boolean = true

scala> n + 1 > 0
res5: Boolean = false

Floats

scala> 0.1 + 0.2
res0: Double = 0.30000000000000004

Money

  • Should never overflow
  • Should not have rounding errors
  • But things that convert to money can be float

Data Modeling like a boss

Store facts

  • Absolute
  • Do not change with time
  • New facts can be added
  • E.g. Date of birth vs Age

Extensible formats/data types

  • Data models evolve
  • Some data maybe missing
  • Express missing data

Store meta data

  • When (date/version)
  • Where (location/system)
  • How (collection/conversion method)
  • Who (system/person)

1. Predefine intended usage

  • What do we need the data for?
  • How are we going to use it?
  • Multiple versions for multiple use cases

2. Select needed subset of information 

  • Think what you need and what you don't
  • Document why

3. Define constraints

  • Resolution
  • Bounds
  • Type

4. Define collection method

  • Analog to Digital
  • Digital converters
  • Lossy/non-lossy
  • Possible artifacts
  • Versioned

5. Select compatible data type

  • Universal/specific
  • Secondary formats
  • Possible artifacts
  • Versioned

Let's model a person

For an HR system

Data

  • Surname, full legal name
  • Date of Birth (ISO8601)
  • Government ID number
  • OR independent unique ID
  • Facial photo, jpg
  • Address - free text (1000 chars), zip code alphanumeric (10 chars)
  • Bank account

Meta data

  • Collection date
  • Collection system
  • Collection SW version
  • User ID

Thanks! Questions?

Data: You keep using that word

By Avishai Ish-Shalom

Data: You keep using that word

structured data, dynamic data, big data, data driven..... we hear about data all the time. But what is "data" exactly? The term is frequently used, yet is rarely defined or thought of - and it turns out the answer to "what is data" is not simple at all. "Data" is a software concept which describes real world properties and information - and there must exist some process of creating "data" from those properties. This "data modeling" process is one of the most fundamental and complex exercises in software engineering but it's often overlooked and taken for granted. In addition, once you have data it must be actively maintained. Corrupted is relatively easy to detect, but how do you know if you are _missing_ data? This talk aims to review what "data" is, how it is created, defined and maintained show real world examples of the complexities, problems and solutions of these often ignored processes.

  • 2,872