Data Acquisition, Use, Management; Data Fabrication and Falsification
University of Michigan Ross School of Business
RCRS
Ethical considerations in the use of secondary data
- When is analysis of secondary data considered human subjects research?
- What ethical obligations do researchers have in the storage and use of secondary data to the subjects found within the data?
- What ethical obligations do researchers have in the storage and use of secondary data to the scientific community more broadly?
- Data resources
First consideration: Is it "human subjects" research?
Human subjects research: human beings are your research subjects and may be directly affected by the research.
If it is human subjects research: need additional oversight of your research by the Institutional Review board (IRB).
- Submit application describing your project
- IRB determines whether research follows ethical principles
- May ask you to make changes to ensure ethical principles are upheld
- IRB website: https://eresearch.umich.edu/
What constitutes human subjects research with secondary data
Not human subjects research:
Publicly available data
Data that cannot ever be re-identified (crosswalk to identifying information has been destroyed)
Data that the researcher has no way to re-identify, including through tricky means
Even without direct identifiers, data that includes very detailed information such as exact date of birth, zip code of residence, etc, may be considered human subjects research.
If your data falls in the above categories, you do not need the IRB to approve your project.
What constitutes human subjects research with secondary data
Human subjects research:
Data sets with any direct identifying information (such as names, address, social security number).
Data with enough detailed information that subjects can be re-identified.
If you are conducting human subjects research, you will need to get IRB approval, which you can submit via eresearch.umich.edu
What constitutes human subjects research with secondary data
Human subjects research:
Often, if data are identifiable, consent from participants is required. However, this is often infeasible. The IRB may grant a waiver of informed consent if the researcher can demonstrate that :
- There are no more than minimal risk to subjects
- The research could not be carried out practicably without the waiver
- The waiver or alteration will not adversely affect the rights and welfare of the subjects
- There is no way to do this research using de-identified data
What constitutes human subjects research with secondary data
How can we demonstrate this?
- There are no more than minimal risk to subjects: What are the risks to subjects when you analyze identifiable data?
- The research could not be carried out practicably without the waiver
- The waiver or alteration will not adversely affect the rights and welfare of the subjects
- There is no way to do this research using de-identified data
Ethical values in using secondary data
Privacy: is the control over the extent, timing, and circumstances of sharing oneself (physically, behaviorally, or intellectually) with others.
Confidentiality: pertains to the treatment of information that an individual has disclosed in a relationship of trust and with the expectation that it will not be divulged to others without permission in ways that are inconsistent with the understanding of the original disclosure.
Acquiring data
Many data providers will provide you data if you sign a data use agreement, or a legal contract governing the use of the data and your scope for publishing with the data.
It is important to submit these to the UM Office of Research and Sponsored Projects for review, so that they receive an institutional signature.
Why?
- Lawyers will review it, make sure it is not asking anything unexpected
- Protects you from unexpected legal consequences
- Can help make sure you are compliant with the data agreement
Acquiring data
Data use agreements often require specific data security measures.
The Ross office of technology can help with this.
- Identify servers on campus that already have appropriate measures in place (e.g. Armis2 and Yottabyte servers are designed to be HIPAA-compliant).
- Can install software on your Ross machine to make sure you are compliant (e.g. encryption software).
Data falsification/fabrication
Researchers also have an obligation to the scientific community and to the public to present results honestly and accurately.
Your number one asset as a researcher is your reputation for doing honest and accurate research.
Case Study: "When contact changes minds: An experiment on transmission of support for gay equality"
In 2014, bombshell article published in Science showed that contact with gay political canvassers led people to have more tolerant views on gay marriage.
The junior author, a PhD student in political science, was a star on the market and secured an assistant professor position at Princeton.
Other political scientists were impressed, and wanted to conduct follow up work.
Upon investigating--they discovered the original authors had falsified the data.
Case Study: "When contact changes minds: An experiment on transmission of support for gay equality"
What happened?
Paper was retracted from Science
Junior scholar's job offer at Princeton was rescinded, ended up leaving academia entirely
National news coverage of the event
In my experience, every interesting paper gets replicated, no exceptions.
Data falsification
Fabricating data is extreme. But many data analysis practices fall into an ethical gray zone which may amount to data falsification.
Data falsification: manipulating data to give a false impression of the results. Could be:
- Dropping "outliers" or other observations without justification.
- Failing to report specifications that do not confirm your hypothesis.
- Running many specifications in hopes of finding a certain result.
- Making changes or edits to dataset that are not documented in the paper.
- Misrepresenting or incorrectly describing your analysis.
In my experience you are always better off avoiding doing anything that does not stand up to scrutiny. This isn't the 90s anymore! Papers get replicated and if your result is shaky, this will get uncovered.
Reproducibility
Assume your paper will be replicated. This is a good thing!
- This means people like and want to build on your work.
Your work will have more influence if it is easy to replicate.
Many journals now requiring this as a condition of publication.
Make it easy to replicate by:
- Having "clean" code that minimizes "hard coding."
- You should be able to run your programs and reproduce every number in the paper.
- Include many comments explaining what you are doing.
- All sample inclusion/exclusion criteria should be described in your manuscript or in a data appendix
- Post your code and data online if allowed.
- Pre-specifying analyses to the extent possible
Data Resources on Campus
Federal Statistical Research Data Center:
Firm- and establishment-level data: Longitudinal Business Data
-Firm and establishment level data for the entire non-farm private sector in the United States (earnings, employment, openings and closings)
Longitudinal Employer-Household Dynamics: Quarterly earnings for individuals with UI-covered wages (~90% private employment), can track employees across firms/establishments
Numident-All deaths to SSN holders in the United States, linked to all Census data, including Census surveys (detailed info on education, occupation, earnings, etc.)
Data Resources on Campus
Data Resources on Campus
Institute for Healthcare Policy and Innovation:
Private and public health insurance claims for millions of beneficiaries
All hospitalization/ED visits for large number of states
University of Michigan EHR records
Ethical Data Management and Use
Researchers cannot influence scientific progress, the public, policy makers or the business world without credibility.
Study participants and companies will not agree to share data if they are worried researchers will disrespect their privacy or confidentiality.
We all have a responsibility to conduct data analysis ethically and minimize participant risks.
Questions
RCRS
By umich
RCRS
- 383