Data Management and Organization

In the digital age, data has become an everyday result of scientific research.

 

However, with an exponential growth in data volume and complexity comes an increased risk of mismanagement, potentially undermining years of painstaking work.

Why Care About Data Management

  • Reproducibility crisis.
  • Data usability and interpretability decline over time, especially if metadata isn't properly maintained (Information Entropy).
  • Research data availability decreases by 17% per year, costs US $28 billion a year.

Tabular Data

The data, when organized, will be in rows and columns format. Each row is a different sample, and each column represents a different variable.

Importance of Tabular Data:

  • Storing in its rawest form.
  • Most statistical tools (including Excel and R) prefer data in tabular form because of the structure of the data.
  • You'll thank yourself later by keeping an organized version of your raw data in table form (I promise.)
  • other reasons we will get to !!

Metadata

data about data. It gives a description and context to your data- makes it understandable and interpretable.

why care:

  • clarifies what each column in your dataset represents.
  • In the case you may not be available to help someone understand or use your dataset.
  • Units, units, units - should be defined correctly to avoid possible misinterpretations or errors of analysis.

Your metadata must include

  • Name of Column: The exact name as it appears in your dataset.

  • Description: This is a brief description of the data represented in that column.

  • Type of Data: Categorical, Numerical, Date, etc.

  • Units: Any units that the data is measured in.

  • Notes: Any other notes or context regarding the data in that column.

Always Remember

  • Keep the Original: Always make a copy of your original dataset (with metadata) before starting analysis or manipulations. This will ensure that the authentic nature of your raw data will be preserved without modification.
  • Backup: Raw and processed data should be backed up regularly to prevent losses.
  • Version Control: Save versions at appropriate steps along your analysis. This way, you can see clearly how your data handling has evolved and revert to an earlier version if necessary.

scroll down for more about version control 

Save your derived datasheets (XLSX or R outputs) with dates at the end; for example, data_21aug2024.xlsx. Do not use words like “final” to describe your data!

 

Other Pearls of Wisdom 1

  1. Plan Comprehensively: Before beginning any research project, develop a detailed data management plan that covers collection, processing, and preservation strategies.
  2. Implement Rigorous Data Input Practices: Use data validation tools and input forms to reduce the chance of errors during data collection.
  3. Document Meticulously: Stay current on documentation throughout the project. Remember, what's obvious now may be mysterious in a few months or years.

Other Pearls of Wisdom 2

  1. Backup Religiously: Maintain at least three current backup copies of important work, including the original unprocessed dataset and milestone versions of processed files.
  2. Prioritize Data Cleaning: Give adequate attention to cleaning errors from the dataset prior to analysis, but be careful to maintain legitimate outliers.
  3. Preserve Raw Data: Always keep an unaltered copy of the raw data. This allows you to start over if needed and aids in reproducibility.