Data Science for Physical Scientists

dr.federica bianco | fbb.space |    fedhere |    fedhere 
Scientific Visualizations

why?

a few historical plots and why they made history

  • Descriptive data viz
    • Lie with statistics
    • Tufte’s rules
  • Exploratory data viz

Jer Thorp

 

  • Psychophysics
  • Esthetics vs(??) functionality
    • color blindness
    • the third dimension
  • Interactivity

why?

computers understand data as numbers,

we (people) do not.

computers understand data as numbers,

we (people) do not.

what is this?

computers understand data as numbers,

we (people) do not.

Van Gogh starry night

I II III IV
X Y X Y X Y X Y
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.1 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.1 4 5.39 19 12.5
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

what is this?

Anscombe's quartet?

(Francis Anscombe, 1973) comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.

 

the moral of the story is: look at your data!

https://github.com/fedhere/DSPS_FBianco/blob/master/labs/Anscombe's_Quartet.ipynb

A common problem: too many points

 

plt.plot(Teff, logg, 'k.')

The problem with big data

the larger the data, and especially the higher the number of dimensions, the harder to design a visualization that is effective

The importance of visualizations:

Anscombe's quartet

(Francis Anscombe, 1973) comprises four datasets that have nearly identical simple descriptive statistics, yet appear very different when graphed. Each dataset consists of eleven (x,y) points.

https://github.com/fedhere/MLPNS_FBianco/blob/main/viz/Anscombe's_Quartet.ipynb

Visualization of derived data products may be effective 

the larger the data the harder it is to collectively understand the descriptive  statistics

A common problem: too many dimensions

 

A common problem: too many dimensions

 

too many time series

too many time series

Tufte's small multiples and

spakrlines

  1.  
  1. In time-series displays of <money>, deflated and standardized units of monetary measurement are nearly always better than nominal units.

 

enable comparison  by giving the data center stage

too many time series

Time series heatmaps

 

enable comparison  by giving the data center stage

how?

a few historical plots and why they made history

a few historical plots and why they made history


W.E.B. Du Bois

February 23, 1868 – August 27, 1963

 American sociologist, socialist, historian, civil rights activist, Pan-Africanist, author, writer and editor

https://inspirehep.net/record/1082448/plots

After graduating with a Ph.D. in history from Harvard University, W.E.B. Du Bois, the prominent African-American intellectual, sought a way to process all this information showing why the African disapora in America was being held back in a tangible, contextualized form.

https://www.smithsonianmag.com/history/first-time-together-and-color-book-displays-web-du-bois-visionary-infographics-180970826/

 

“The colorful charts, graphs, and maps presented at the 1900 Paris Exposition by famed sociologist and black rights activist W. E. B. Du Bois offered a view into the lives of black Americans, conveying a literal and figurative representation of 'the color line'."

 

W.E.B. Du Bois 1868-1963, sociologist, black right activist, graphic designer ante litteram

a few historical plots and why they made history


W.E.B. Du Bois

February 23, 1868 – August 27, 1963

 American sociologist, socialist, historian, civil rights activist, Pan-Africanist, author, writer and editor

https://inspirehep.net/record/1082448/plots

a few historical plots and why they made history


W.E.B. Du Bois

Smithsonian Magazine

“Du Bois was aware that while unmoving prose and dry presentations of charts and graphs might catch attention from specialists, this approach would not garner notice beyond narrow circles of academics,” Aldon Morris writes in the essay “American Negro at Paris, 1900.” “Such social science was useless to the liberation of oppressed peoples. Breaking from tradition, Du Bois was among the first great American public intellectuals whose reach extended beyond the academy to the masses.”

https://hyperallergic.com/476334/how-w-e-b-du-bois-meticulously-visualized-20th-century-black-america/

a few historical plots and why they made history

Figurative Map of the successive losses in men of the French Army in the Russian campaign 1812-1813. Charles Joseph Minard (November 20th, 1969)

 

The numbers of men present are represented by the widths of the colored zones in a rate of one millimeter for ten thousand men; these are also written beside the zones. Red designates men moving into Russia, black those on retreat. — The information used for drawing the map were taken from the works of Messrs, Chiers, de Ségur, de Fezensac, de Chambray and the unpublished diary of Jacob, pharmacist of the Army since 28 October.

In order to facilitate the judgment of the eye regarding the diminution of the army, I supposed that the troops under Prince Jèrôme and under Marshal Davoust, who were sent to Minsk and Mobilow and who rejoined near Orscha and Witebsk, had always marched with the army.

a few historical plots and why they made history

Florence Nightingale Coxcombs

Diagram of the causes of mortality in the army in the East,

a few historical plots and why they made history

Florence Nightingale Coxcombs

a few historical plots and why they made history

H-R diagram:

the life of a star

https://en.wikipedia.org/wiki/Hertzsprung%E2%80%93Russell_diagram

a few historical plots and why they made history

Fynman Diagrams

https://inspirehep.net/record/1082448/plots

Exchange of one quantum between two electrons.

a few historical plots and why they made history

Fynman Diagrams

https://inspirehep.net/record/1082448/plots

Exchange of one quantum between two electrons.

SPACE

I would argue R. Fynmann had the first "science outreach" program that driving a van with the Faynmann diagrams on it. THe visual effectiveness of the se diagram is what made them suitable for this.

The dark part of his legacy is his well-documented, blatant misogyny and sexism: please do read this to get a complete picture of the person. https://thebaffler.com/outbursts/surely-youre-a-creep-mr-feynman-mcneill .

Faynmann

what     makes     a    bad       visualization?

Ambiguity  |  distortion  |   distraction.

 
 

Ambiguity  |  distortion  |   distraction.

 
 

what is wrong with this plot????

Prof. Vern Lindberg

6 wrong things with this plot…

Prof. Vern Lindberg

6 wrong things with this plot…

An example of ambiguity in visualizations that is common in peer review physics

different stretch

Ambiguity  |  distortion  |   distraction.

 
 

Ambiguity  |  distortion  |   distraction.

 
 

((=misleading)

Ambiguity  |  distortion  |   distraction.

 
 
 

Ambiguity  |  distortion  |   distraction.

 
 
 

ordinary matter

Exactly this plot is in the front page of the Plank collaboration website! Plank is am $800M mission to study the earliest Universe

Dark Matter

Dark Energy

An example of ambiguity in visualizations that is common in peer reviewed physics

duplication of data: commonly planet transit and eclipsing binary dataset are repeated twice (consecutively along the x axis) 

A highly unequal-mass eclipsing M-dwarf binary in the WFCAM Transit Survey

Nefs, S.V. et al. MNRAS. 431 (2013) 3240 arXiv:1303.0945 [astro-ph.SR]

Mollweide projection

equirectangular projection

necessary distortions

Hajime Narukawa, a Tokyo-based architect and artist, broke the globe up into 96 regions and folded it into a tetrahedron and then a pyramid before finally flattening it into a two-dimensional sheet.  This won him the 2016 Japan’s prestigious Good Design prize.

Ambiguity  |  distortion  |   distraction.

 
 

Sometime the distraction is a consequence of the complexity of the data.

 

distraction

data-ink ratio (Edward Tufte)

what     makes     a    good       visualization?

Tufte's rules

Edward Tufte

Tufte’s rules:

Lie factor =

    size of the effect in the graphic
    size of the effect in the data

Tufte’s rules:

Lie factor =

    size of the effect in the graphic
    size of the effect in the data

Tufte’s rules:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented   ("lie factor")                                                                                                                          
  2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity.  Write out explanations of the data on the graph itself.  Label important events in the data                                                                             
  3. Show data variation, not design variation                                                                                
  4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.                                                                                                                                
  5. The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

effect size ~ 1

data/ink -> large

no chart junk 

use small-multiples

avoid redundancy in communication

Tufte’s rules:

Keep lie factor ~1

    size of the effect in the graphic
    size of the effect in the data

Tufte’s rules:

Data-ink ratio =       

 

    amount of data

    amount of ink

Tufte’s rules:

Data-ink ratio =       

 

    amount of data

    amount of ink

Tufte’s rules:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented   ("lie factor")                                                                                                                          
  2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity.  Write out explanations of the data on the graph itself.  Label important events in the data                                                                             
  3. Show data variation, not design variation                                                                                
  4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.                                                                                                                                
  5. The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

effect size ~ 1

data/ink -> large

no chart junk 

use small-multiples

avoid redundancy in communication

Tufte’s rules:

Chart Junk

the excessive and unnecessary

use of graphical effects

Tufte’s rules:

Chart Junk

the excessive and unnecessary

use of graphical effects

Tufte’s rules:

Chart Junk

the excessive and unnecessary

use of graphical effects

Tufte’s rules:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented   ("lie factor")                                                                                                                          
  2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity.  Write out explanations of the data on the graph itself.  Label important events in the data                                                                             
  3. Show data variation, not design variation                                                                                
  4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.                                                                                                                                
  5. The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

effect size ~ 1

data/ink -> large

no chart junk 

use small-multiples

avoid redundancy in communication

Tufte’s rules:

Small multiples      

 

 encourage comparison

sparkline graph

Tufte’s rules:

Small multiples      

 

 encourage comparison

sparkline graph

Tufte’s rules:

Small multiples      

 

Galileo Galilei, Jupiter moons, 1610

Tufte’s rules:

Small multiples      

work really well with maps!

 

https://mahb.stanford.edu/whats-happening/167-tiny-maps-tell-major-story-climate-change/   

 

Galileo Galilei, Jupiter moons, 1610

Tufte’s rules:

Small multiples 

... missing the point.

 

https://vividmaps.com/comparing-metropolitan-form-density/

Tufte’s rules:

Small multiples

Keiran Healy

(Data Viz A Practical Intro)

 

Galileo Galilei, Jupiter moons, 1610

Tufte’s rules:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented   ("lie factor")                                                                                                                          
  2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity.  Write out explanations of the data on the graph itself.  Label important events in the data                                                                             
  3. Show data variation, not design variation                                                                                
  4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.                                                                                                                                
  5. The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

effect size ~ 1

data/ink -> large

no chart junk 

use small-multiples

avoid redundancy in communication

Tufte’s rules:

every feature should be associated with only 1 graphical element

 

(here color is redundant with length)

Tufte’s rules:

  1. The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented   ("lie factor")                                                                                                                          
  2. Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity.  Write out explanations of the data on the graph itself.  Label important events in the data                                                                             
  3. Show data variation, not design variation                                                                                
  4. In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.                                                                                                                                
  5. The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

effect size ~ 1

data/ink -> large

no chart junk 

use small-multiples

avoid redundancy in communication

what     makes     a    good       visualization?

last round, including animation and interactivity:

Tamara Munzner's rules

Rules of thumb for a good visualization

Tamara Munzner

Chapter 6

Function first, Form next

no unjustified beauty

(Tufte's no chart junk)

Get it right in Black & White

no unjustified color

consider designing your plot in BW first

Examples of functional use of color (and distortion)

functional use of color is the standard in anatomy drawings

 Häggström, Mikael (2014). "Medical gallery of Mikael Häggström 2014". WikiJournal of Medicine 1 (2). DOI:10.15347/wjm/2014.008. ISSN 2002-4436. Public Domain. or By Mikael Häggström, used with permission. - Image:Gray507.png

Examples of functional use of color (and distortion)

functional use of color

functional use of deformation

(the boroughs' size is changed to make the distance between subway stops similar.

but consider the psychological and social implications of blowing up Manhattan...

MTA  NYC subway map

No Unjustified 3D

use 3D only if your 3rd dimention cannot be reduced.

Alternatives:

color,

small multiples,

animation

from Tamara Munzner chapter 6

 

obstruction

clutter

deformation

No Unjustified 3D

from private communication...

No Unjustified 3D

distortion

techniques

to overcome obstruction

 

downside: distortion, clutter

from Tamara Munzner chapter 6

 

No Unjustified 3D

transparency

to overcome obstruction

Zosia Rostomian, LBNL; Nic Ross, BOSS Lyman-alpha team, LBNL; and Springel et al, Virgo Consortium and the Max Planck Institute for Astrophysics

No Unjustified 3D

marginalized posteriors

in MCMC

"Corener Plot"

Short Cases in Surgery (Color Edition) by MRCS Dr R Rajamahendran MS

An Investigation of Issues and Techniques in

Highly Interactive Computational Visualization

Michael John MCGuffin

No Unjustified 3D

Anatomical drawing style deals with obstruction while preserving the context

No Unjustified 3D

Also:
No Unjustified 2D!

Also:
No Unjustified 2D!

Eyes over Memory

no unjustified animation

Eyes over Memory

no unjustified animation

Interactivity

interactive visualization rules of thumb:

Resolution over immersion

Details on demand

Avoid latency

Interactivity

interactive visualization rules of thumb:

Resolution over immersion

Details on demand

Avoid latency

Interactivity

interactive visualization rules of thumb:

Resolution over immersion

Details on demand

Avoid latency

how would you improve these plots?

A true story of a plot created for in clusion in a paper of mine

A true story of a plot created for in clusion in a paper of mine

A true story of a plot created for in clusion in a paper of mine

A true story of a plot created for in clusion in a paper of mine

how would you improve these plots?

Plot A

E. Hubble 1929

how would you improve these plots?

Plot B

 2020

how would you improve these plots?

(an option: log scale)

how would you improve these plots?

Plot C

 

how would you improve these plots?

I would say this plot is at the limit of confusion (information saturation)

how would you improve these plots?

Plot D

(no author, no need to shame, but this was published in a peer reviewed journal)

how would you improve these plots?

Plot E

 

how would you improve these plots?

Plot F

 

how would you improve these plots?

Plot G

 

how would you improve these plots?

Plot H

 

how would you improve these plots?

Tufte’s rules:

Tufte’s rules:

chart junk

2 graphical elements for frequency

(color and position)

low data/ink ratio

no comparison

Tufte’s rules:

chart junk

2 graphical elements for frequency

(color and position)

low data/ink ratio

no comparison

comparison but scale out of context

high effect-size due to the choice of color map (more on this later)

Tufte’s rules:

chart junk

2 graphical elements for frequency

(color and position)

no comparison

Tufte’s rules:

chart junk

2 graphical elements for frequency

(color and position)

no comparison

Tufte’s rules:

Tufte’s rules:

consider this:

Graphic Vocabulary

What graphical elements are available and what elements are appropriate to convey certain information?

Graphic Vocabulary

The ideal of all research is:

 

1. precise investigation of each individual phenomenon — in isolation,

 

2. the reciprocal effect of phenomena upon each other — in combinations,

 

3. general conclusions which are to be drawn from the above two divisions.

 

My objective in this book extends only to the first two parts. The material

in this book does not suffice to cover the third part which, in any case,

cannot be rushed.

 

The investigation should proceed in a meticulously exact and pedantically

precise manner. Step by step, this "tedious" road must be traversed — not

the smallest alteration in the nature, in the characteristics, in the effects

Point, Line, and Plane, Wassily Kandinsky, 1926

Jacques Bertin: Semiology of Graphics, 1967 Gauthier-Villars, 1998 EHESS

point

plane

line

  • Continuous:    distance to the closest star (can take any value)

Continuous data may be:

  • Continuous Ordinal:    Earthquakes (notlinear scale)
  • Interval:          F temperature - interval size preserved
  • Ratio:              Car speed - 0 is naturally defined

 

  • Discrete:         any countable, e.g. number of brain synapses

Discrete data may be:

  • Counts:          number of bacteria at time t in section A
  • Ordinal:         survey response Good/Fair/Poor

 

  • Categorical:     fermion - bosons: any  object by class
    •  

Data may also be:

  • Censored:       star mass >30 Msun
  • Missing:          “Prefer not to answer” (NA / NaN)

data types

graphical elements work differently on different data types

data types

graphical elements work differently on different data types

The study of human perception

psychophysics

why do we visualize? cause our brain is best suited to receive visual sitmuly (compared to tactile or auditorial for example)

psychophysics

why do we visualize? cause our brain is best suited to receive visual sitmuly (compared to tactile or auditorial for example)

psychophysics

exploring these alternative is important as a social justice issue to make science and data science accessible to vision impaired people but also: can the different ways in which we process information can give new insight?

e.g.: we process visual stimuli but sound stimuli holistically (we hear a chord, not the notes in it)

 

psychophysics

The study of human perception

here limited to visual stimuli

The apparent magnitude of all sensory channels follows  a power law based on the stimulus intensity

 

S sensation, I intensity

Psychophysical power law

Stevens 1975

S=I^n

Stevens 1975

response to length:

when shown something 4x as long we perceive it as being 4x as long

response to brightness:

when shown something 4x as bright we perceive it as being 2x as bright

I=S
I=\sqrt{S}

response to saturation:

when shown something 4x as saturated we perceive it as being 11x as saturated

I=S^{1.7}

Stevens 1975

response to electroshock:

when given an electroshock 4x as strong

we perceive it as 128x as strong 

I=S^{3.5}

(personally, I do not know of any electroshock based visualizations)

Heer and Bostock 2010

modern version gets uncertainties to these quantities by crowdsourcing the tests

Stevens 1975

Heer and Bostock 2010

The detectable difference in stimulus intensity is a fixed percentage of the object magnitude

 

δI / I = K

I intensity, K constant

Weber law

We judge based on relative differences

Color

theory

(and good practice)

Good and Bad color choices

very real consequences of bad color choices

Borkin et al. 2011

Borkin et al. 2011

Eye Physiology and color perception deficiencies

color blindness

Color blindness (color vision deficiency, CVD) affects approximately

1 in 12 men (8%) and 1 in 200 women

in the world.

Worldwide, there are approximately 300 million people with colour blindness, almost the same number of people as the entire population of the USA!

color blindness

color blindness

Protanopia

color blindness

Protanopia (red-blind)

color blindness

Protanopia (green-blind)

color blindness

Tritanopia (blue-blind)

Rods   |  Cones

Brightness |  Color

R

G

B

small differences can still be percieved as colors are also associated to brightness

31%

59%

10%

brightness:

use the http://colororacle.org/ app to test your plots for color-blindness

Kelly 1965 designed a list of 22 maximally contrasting colors for colorblind compliance (the “Kelly colors”):

"#023fa5", "#7d87b9", "#bec1d4", "#d6bcc0", "#bb7784", "#8e063b", "#4a6fe3", "#8595e1", "#b5bbe3", "#e6afb9", "#e07b91", "#d33f6a", "#11c638", "#8dd593", "#c6dec7", "#ead3c6", "#f0b98d", "#ef9708", "#0fcfc0", "#9cded6", "#d5eae7", "#f3e1eb", "#f6c4e1", "#f79cd4"

Jer Thorp

visualizations for data exploration?

Jer Thorp

there definitely are historical precedents:

John Snow's map of cholera,

considered the first

"data science project"

uses "clustering" to drive causal inference

https://en.wikipedia.org/wiki/1854_Broad_Street_cholera_outbreak#cite_ref-FOOTNOTESnow1855[httpsarchiveorgstreamb28985266page38mode1up_38]_19-0

 

 

Using visualizations to understand the data, not to communicate a result

John Snow - Published by C.F. Cheffins, Lith, Southhampton Buildings, London, England, 1854 in Snow, John. On the Mode of Communication of Cholera, 2nd Ed, John Churchill, New Burlington Street, London, England, 1855.

Jer Thorp

Using visualizations to understand the data, not to communicate a result

why the paradigm shift?

but only recently visualizations to aid science exploration became a well developed and active field of research

increased data volume

Big data:

One of Thorp’s projects is a visualization of the number of times the terms “communism” (bottom) and “terrorism” (top) appeared in The New York Times, from 1981 until 2009. The spike for “terrorism” is the reflection of 9/11. As the word “terrorism” is used more and more, the use of the word “communism” decreases. (Image courtesy Jer Thorp; flickr.com/photos/blprnt/)

 

increased data complexity

Big data:

https://www.flickr.com/photos/blprnt/3291268016/in/album-72157614008027965/

These visualizations show the top organizations and personalities for every year from 1985 to 2001. Connections between these people & organizations are indicated by lines.

Data is from the newly-released NYTimes Article Search API: developer.nytimes.com

For more information, and source code to access the NYTimes API, visit my blog: blog.blprnt.com

A common problem: too many points

 

solution: subsample

plt.plot(Teff[::10], logg[::10], 'k.')
plt.plot(Teff, logg, 'k.')
plt.plot(Teff, logg, 'k.', alpha=0.1)

solution: alpha

A common problem: too many points

 

solution: density histograms

plt.hist2d(Teff, logg, bins=(50, 50), cmap=plt.cm.Greys)
plt.plot(Teff, logg, 'k.')
plt.plot(Teff, logg, 'k.', alpha=0.1)

solution: scatter contours

A common problem: too many points

 

Bad Color Choice! 

A common problem: too many points

 

A common problem: too many dimensions

 

Solution: dimensionality reduction to 2 dimension

Bianco et al 2015 https://arxiv.org/abs/1611.04633

e.g. Principal Component Analysis

A common problem: too many dimensions

 

Solution: dimensionality reduction to 2 dimension

e.g. t-SNE

A common problem: too many dimensions

 

Solution: dimensionality reduction to 2 dimension

e.g. t-SNE

The t-distributed Stochastic Neighbor Embedding (SNE) method, (Maaten & Hinton, 2008), improving upon SNE (Hinton & Roweis 2003). SNE works by embedding multidimensional Euclidean distances with conditional probabilities. The similarity between x_i and x_j is the conditional probability Px_j |x_i that x_i will choose xj as a neighbor under the normal distribution. Do the same in the full dimensional and lower dimensional space,  SNE then attempts to minimize the Kullback-Leibler (KL) divergence between the two probability. However, SNE is computationally very expensive; t-SNE attempts to resolve this issue by looking at a “symmetric” SNE and redefines the lower dimensional distribution using a Student t-distribution.

t-SNE

embedding multidimensional Euclidean distances with conditional probabilities,  Then the similarity between xi and another data point xi′ is the conditional probability P(xi′|xi) that xi will choose xi′ as a neighbour under the normal distribution 

 

Maaten & Hinton (2008)

extremely sensitive to hyperparameters

(chosen with kl-divergence) 

perplexity = 200,

early_exaggeration = 5.0

UMAP

 create a k-neighbour weighted graph by considering k-neighbours of each xi, and adding an edge in the graph with a defined weight w that depends on the diameter of the k-neighbourhood of xi, and the distance between xi and the closest neighbour. 

McInnes et al. (2018)

a = w(xi, xj), b = w(xj, xi),

w′(xi, xj) = a + b − ab

UMAP minimizes the cross-entropy between the weight functions in the original and reduced space

too many time series

too many time series

Tufte's small multiples and

spakrlines

  1.  
  1. In time-series displays of <money>, deflated and standardized units of monetary measurement are nearly always better than nominal units.

 

enable comparison  by giving the data center stage

too many time series

Time series heatmaps

 

enable comparison  by giving the data center stage

Minard's russian campaign : so why is this plot so good?

Figurative Map of the successive losses in men of the French Army in the Russian campaign 1812-1813.

 

The numbers of soldiers present are represented by the widths of the colored zones in a rate of one millimeter for ten thousand soldiers; these are also written beside the zones. Red designates men moving into Russia, black those on retreat. — The information used for drawing the map were taken from the works of Messrs. Chiers, de Ségur, de Fezensac, de Chambray and the unpublished diary of Jacob, pharmacist of the Army since 28 October. In order to facilitate the judgement of the eye regarding the diminution of the army, I supposed that the troops under Prince Jèrôme and under Marshal Davoust, who were sent to Minsk and Mobilow and who rejoined near Orscha and Witebsk, had always marched with the army.

so there is a thing called "the rule of 7": you cannot put more than 7 pieces of information in your plot because that his the maximum number of things a person can remember. Well, that 7 comes from   a test where people are told several words and asked to repeat them back. On average people remember 7... +/-4 ... 

The number of information elements that are shown in a plot depends on how effectively you can show them. This plot contains (at least) the following features:
space (distance, however approximate), time,  size of the army, rate of lives lost (highly covariant with size of the army), purpose (going on the attack toward Moskow or retreating, indicated by the color),  topography (changes of direction, rivers), temperature, the last 2 are conveying a causal connection by showing the lives lost (decrease in width of the army size) in conjunction with critical temperatures and rivers)

lie factor=1, data/ink ratio high, no chart junk,  #graphical elements<#features

functional use of color, no unjustified 3D, eyes over memory (granted.... they did not have animations back then)

Tufte's rules

Munzner's rules

key concepts

 

Be thoughtful and make sure your visualizations are (in this order):

honest

clear

convincing

beautiful

key concepts

 

Identify the purpose of your visualization:

 

visualize to communicate results

visualize to understand data and guide analysis

resources

 

Edwaed tufte (anything)

Tamara Munzner

Visualization Analysis & Design, 2014

(link to a talk slide-deck about her book: http://www.cs.ubc.ca/~tmm/talks/minicourse14/vad15london.pdf)

Wassily Kandinsky, Point, Line, and Plane,  1926

Six Lessons from the Bauhaus: Masters of the Persuasive Graphic

http://blog.visual.ly/six-lessons-from-the-bauhaus-masters-of-the-persuasive-graphic/

resources

 

reading

 

Any of these papers:

Create a plot, of whatever data (and models if you want) you choose from open data (if you have doubt about whether your dataset is relevant for this homework please email me.)

You can make the plot in any coding language you want (e.g. python, javascript, R...), as long as you upload the code that generates the plot onto your repo (which means no tableau, or any other non reproducible).

Create a directory HW8_<firstLast> in your DSPS repo. The plot neads to be uploaded onto the HW8 folder in your github DSPS repo and be embedded in the README.md. That means: when I click on the HW8 link the plot must be rendered in the front page of the repo. Your readme must contain the plot, and a brief caption. If it is an interactive graphic, upload a static image of it in the README and provide a link to the interactive version.

Please make an effor to make it a good, compelling graphic. Put though into the esthetic of the plot, how clearly the content is communicated, avoid clutter, avoid misleading elements, mind your choice of colors accordingly to what was discussed in class.

Each of you needs to upload their own plot, no group submissions.

If your plot shows up as I described above in the repo and the code is also uploaded you will get 100% of the HW points. (Next week you will be tasked to review 3 plots of your classmates and you will be graded on the quality of the review.)

4

homework

 

1

Follow scheleton notebook to create an H-R diagram visualization with datapoints and contours

EC: make your visualization interactive so that rolling on any datapoint provides information about the data

homework

 

2

Data Science for Physical Scientists

By federica bianco

Data Science for Physical Scientists

some notes on visualizations

  • 291