Data Organisation in Spreadsheets

Simon Hettrick

https://www.software.ac.uk/policy

...but it's not practical to ignore them

https://goo.gl/Ahn1Yx

What's wrong with spreadsheets?

  1. Analysis is not easily reproduced
  2. Mistakes are easy to introduce
  3. They're overloaded

What's the plan?

Will teach

Good data entry practices

Avoiding common mistakes

Dates as data

Quality control

Exporting

Won't teach

Stats!

Plotting

Writing code in spreadsheets

Formatting data in spreadsheets

Structuring data

  1. Raw data should be left untouched

Structuring data

  1. Raw data should be left untouched
  2. Keep a record of the steps in your analysis

Structuring data

  1. Raw data should be left untouched
  2. Keep a record of the steps in your analysis
  3. Variables in columns, observations in rows

Structuring data

  1. Raw data should be left untouched
  2. Keep a record of the steps in your analysis
  3. Variables in columns, observations in rows
  4. Export resulting data to a csv

"CSV": comma separated values

Structuring data

  1. Raw data should be left untouched
  2. Keep a record of the steps in your analysis
  3. Variables in columns, observations in rows
  4. Export resulting data to a csv
  5. Don't combine cells

Structuring data

  1. Raw data should be left untouched
  2. Keep a record of the steps in your analysis
  3. Variables in columns, observations in rows
  4. Export resulting data to a csv
  5. Don't combine cells

Making life easier for computers

...and, occasionally, more difficult for us

Messy data

 

  1. Open: survey_data_spreadsheet_messy.xls
  2. Will this data be easy to work with?
  3. Discuss your thoughts with your neighbours

Try it out...

(If you can't find the spreadsheet, go here: https://goo.gl/BLhpRF)

Dates are messy

What day is it?

 

1 August 2017

August 1, 2017

1/8/17

01/08/2017

1 Aug '17

 

And that's not all...

And that's not all...

42948

And that's not all...

42948

or

41488

And that's not all...

History since 1900

1900 -- Psychotherapy

1901 -- Wireless radio transmission

1903 -- Development of flight

1905 -- Einstein and relativity

1906 -- Recorded music

1907 -- Plastics

1909 -- Poles explored

1909 -- Assembly line

1910 -- First efficient light bulb

1913 -- Quantum mechanics

1914 -- Panama Canal opens

 

1920s -- Leisure time

1920s -- Teen-age culture

1920s -- Fractional horsepower

1920 -- Liberation of women

1921 -- Insulin

1927 -- Lindbergh flies Atlantic

1928 -- New visions of the universe

1928 -- Antibiotics and penicillin

1933 -- Spread of electricity

1935 -- Unions and workers' rights

1939 -- Plasma and blood typing

 

1946 -- Computers

1945 -- United Nations is founded

1945 -- Spread of democracy

1947 -- Transistor

1950s -- Rock 'n' roll

1951 -- Aspirin tablet

1953 -- DNA and biotechnology

1953 -- Everest scaled

1954 -- Polio, small pox vaccines

1960 -- Lasers

etc.

History before 1900

(according to Excel)

Nothing

Try it out...

There is an unambiguous date format...

 

... but it is really ugly

Day = 1

Month = 8

Year = 2017

 

 

 

Try it out...

Be careful with CSVs

Try it out...

Quality control

Data validation helps ensure valid data entry

(If you can't find the spreadsheet, go here: https://goo.gl/3uy4Qh)

Open: survey_data_spreadsheet_messy_quality_control.xls

Sorting helps identify problems

Try it out...

Find errors with

conditional formatting

Try it out...

Exporting data

Why?

 

  • Nothing lasts forever

  • Different versions of Excel may handle data differently

  • More journals and grant agencies are requiring data deposit, and most do not accept Excel

  • (This also applies to other open data formats. LibreOffice / Open Office)

Try it out...

Data Organisation in Spreadsheets

By Simon Hettrick

Data Organisation in Spreadsheets

  • 1,140