Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Data cleaning is an important part of the RDM process. During data cleaning, mistakes in the data collection and processing phase can be fixed. In projects where data collection is not structured properly, data cleaning can take up a lot of time. Sometimes, it may even be quicker to restart data collection in order to avoid a messy dataset.

Therefore, it is important to think about your data collection strategy before collecting data: this can save you a lot of time. This is especially important if several people are collecting data.

If your data is in the form of spreadsheets, you may want to check out our chapter on Data Organisation in Spreadsheets.

How to clean data?

During data cleaning there are several steps to take into account:

1. Backup Your Raw Data

Always save an untouched copy of your dataset before you begin! If you make mistakes during the data cleaning process you can still go back to the original information. All changes to data need to happen in one data file; if you use multiple versions it will be a mess to identify the correct/latest version to work in. If it is important to keep track of multiple versions of the data, you can set up a version control strategy using Version Control, or ensure that you are using the right filename. Don’t forget to document all the changes you made - see step 7 below!

2. Understand/Validate the Data

Data validation is a process used to determine if data are inaccurate, incomplete, or unreasonable (Chapman (2005)).

To validate the data you can check if data are:

Complete

Accurate

Reasonable

Creating tables, plots, or calculating summaries and statistics are a helpful way to search for errors.

3. Remove unnecessary or sensitive data, duplicates and consider outliers

Each unnecessary entry can complicate your analyses and/or distort your findings. You should carefully consider if your dataset contains any sensitive data or Personal data that should not be shared. Delete draft or test entries and duplicates, after carefully reviewing if they are indeed test entries or duplicates. Consider unexpected data, such as unusually high or low values. These could be data entry errors, but they may also be the most interesting findings! Sum-to-total, range and validity checks can help you to find errors.

4. Standardize entries

Align all variables, by for example writing all dates out in a similar manner, and keeping capitalisation consistent. For example, "Yes,” “yes,” and “Y” are treated by computers as separate responses. Ensure that all variables are labeled and used in a consistent manner across different data formats. Use clear naming conventions for your files and folders. Standardising data entry can also help if you have to integrate different datasets resulting from the use of various methods and/or instruments.

5. One variable is one data entry

Each variable should only collect one piece of information (Wickham (2014)). See Data Organisation in Spreadsheets for more tips. Consider how to split variables into multiple variables if more than one thing is measured.

6. Handle Missing Data Thoughtfully

Decide how to manage missing values. Missing data can mean a variety of things: perhaps a participant did not know the answer, or they refused to answer, the question was accidentally skipped, the handwriting was illegible, or there was an error made in data entry, or the value should be blank on purpose. Should missing data be removed, imputed, or marked as “Unknown” or “Not Applicable”? Make sure that this decision is transparently documented somewhere. Deleting missing data entirely may not always make sense as missing data can provide insights on the data collection process.

7. Document Cleaning Steps

If your data is under version control, changes to the data will be automatically tracked. You can also keep a text file with the changes, such as a "Data Cleaning Log.” Record what you changed, when, and most importantly: why. This documentation increases transparency and by adding clear and detailed notes you will allow others to replicate your data cleaning process. Ensure that any old files are clearly labeled as such to avoid any confusion in the future. You can also delete older/messy data if you are sure that you no longer need this version of the data. Versioning files and updating documentation takes time and consideration - and is worthwhile in the long run to save future you and others time trying to retrace your steps.

8. Update the data collection process to prevent future mistakes

It is easier to not have to clean data - you can prevent data errors during data collection. Define range limits on your data where possible (for example, the score must be a value between 0 and 10 for household members; a birthday must be between certain dates). Prevent blank values by requiring data entry, or ensure that where blank values should be present they are intentional.

Sharing cleaned data

Once you are happy with your cleaned dataset, you can consider sharing the data on a data repository. Once this data is shared on a data repository, you can also consider deleting the data from your local storage if you require more space for the storage of other datasets. In any case, you should delete the older/messy datasets to avoid confusion and relinquish yourself from the dumpsterfire.

Data cleaning with a Team/Lab

Often data cleaning is done by one person, but it can be helpful to do data cleaning as a team. This may save time and will guarantee that at least one other person is able to retrace the data cleaning steps. Consider setting up a data cleaning day at the lab to hold each other accountable, and make sure that lab/data cleaning activities are equitably distributed. You can also outline data cleaning activities in the Team/Lab manual.

Resources

References
  1. Chapman, A. D. (2005). Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. 10.15468/doc.jrgg-a190
  2. Wickham, H. (2014). Tidy Data. Journal of Statistical Software. 10.18637/jss.v059.i10