Survey Data Basics
Ryan Clement
Middlebury College
Data Services Librarian
May 4, 2021
What are we covering?
- Types of data
- Survey data basics -- what to look at
- Some useful techniques and visualizations for looking at your data
- Questions about using Qualtrics to look at data
- Suggestions for other tools
What are we NOT covering?
- Rigorous statistical analysis
- How to use R/other tools
Survey Data
what kind of data do we have?
Different types of data
- Categorical/nominal data
- Ordinal data
- Ratio data
- Real number data
- Natural language
Categorical/nominal data
- Constrained to "categories" -- can be characters or numbers
- Race, gender, religion, yes/no, success/fail
- Yes/no, True/false are a subset known as binary data
- Cannot compare along a relative scale
- Cannot use mean/median, must use mode
- Very useful in creating crosstabs
Ordinal data
- Ordered along a scale, but the distances between categories are not known
- For instance, Likert scale questions -- 5 is higher than 4, but is my 5 the same amount higher than your 5?
- Other examples: Income brackets, highest level of education
- Can use to create a ranking, but cannot use other statistics -- What does "3.5 out of 5" mean?
- Look at relative numbers of responses, like nominal data
Interval data
- Order as well as the differences between responses (e.g. the "intervals") are known
- Fahrenheit/Celsius temps, time/date, income vs. spending
- No "true zero" -- negative values can exist, zero has meaning (it's not just "does not exist")
- Can say one value is "higher" than the other, but not necessarily "twice" as high
Ratio data
- Just like interval data, but now absolute zero has meaning and means "does not exist"
- Kelvin temp, height/weight, number of children, years of education, income
- Zero means "does not exist" -- so you can more easily say "X is 2x higher than Y"
Natural Language
- Written by the respondent themselves in a free-text field
- Usually needs to be coded in some way before it can be analyzed
- Can possibly use text analysis tools on this type of data too
- Age
- Grades
- Political party
- Number of poor mental health days in the past week
Mean, median, mode, and standard deviation
Describing the tendencies of data
Mode
- The value that occurs most often
- Almost completely unaffected by outliers
- Can be unimodal: [1, 1, 2, 2, 4, 4, 4, 5, 6, 6] or more
- [1, 1, 1, 2, 2, 3, 4, 4, 4] (bimodal)
- Often used for categorical data
Mean
- We're talking about the arithmetic mean (there are others)
- Often referred to as "the average"
- Calculated by adding up the values and dividing by the number of values: (1+1+2+4+5+5)/6 = 3
- Can be very sensitive to outliers:
- (1+1+2+4+5+20)/6 = 5.5
Median
- The "middle" value in data -- the number that separates the top half from the bottom half
- Calculated by lining up the values and finding the middle value: [1, 1, 2, 4, 4, 5, 5]
- If you have an even number, take the mean of the middle two values:
- [1, 1, 2, 4, 5, 5] -> (2+4)/2 = 3
- Less sensitive to outliers than the mean:
- [1, 1, 2, 4, 5, 5]
- [1, 1, 2, 4, 5, 20]
Standard Deviation
- Measure of how "spread out" the data is
- Need to use the mean to calculate (we won't get into that)
- Lower SD means numbers are mostly clustered around the mean, higher SD means they are more spread out (e.g. there is more variance)
A few useful visualizations + techniques
Look at summaries of your data
- What are type of data are all of your variables?
- How many "complete" cases do you have? (i.e. how much missing data do you have?) Is there meaning/pattern behind this?
- Look at the central trends in your data (mode, mean) - do they make sense?
Distributions - histograms
Distributions - bar charts
Distributions - pie charts
2 categorical variables - crosstabs
Likert scale data -- stacked bar charts
Tool Options
- Qualtrics -- just use the built-in visualizations and cross-tab tools!
- Excel -- export your data from Qualtrics and use Excel to manipulate and visualize your data
- Datawrapper, Infogram, Tableau (also has free student license) -- must use Excel (or something else) to clean and manipulate data first, but great for nicer/more complex visuals
- Voyant -- useful for simple text visuals (if you have a lot of "natural language" data)
- R + RMarkdown -- steeper learning curve, but great for cleaning/manipulating data, visualizing, and "writing" all in one
deck
By Ryan Clement
deck
- 667