Lessons Learnt from R Data Processing Workflow

Data Hygiene

(Image by Tony DePrato)

Doing statistical analysis in my small research project has taught me a few lessons about how to organise the data. So I have summarised some of what I have learnt here. It might not be the best practice for all, but I would like to think of them as my “good data hygiene”.

Lesson: Always keep your raw data separate….I mean really raw

No matter how simple your dataset is, always keep the raw format of your data and organise it in such a way that it cannot be tampered with. Sometimes I have the temptation just to open the dataset in Excel and try to reformat the dates to make it nicer, but, once you tempered with the data, then you have lost the origin. Six months down the track, you might think the data is the raw source, but it is not anymore.

One way to tackle this temptation is to create a folder, purely to store “raw” data. And if you are paranoid that you would temper it ‘unconsciously’, then you can always set the folder to “Read-only”, which give you another protection/warning. Then, create another folder just for processing and copy the raw data into this folder and start modifying it to suit your need.

Lesson: Plan your data processing in multiple stages

“Plan for data processing” is probably common knowledge for everyone. However, most of the time, I think planning doesn’t work quite well for me in terms of data processing - especially you haven’t decided on concrete how you are going to analyse the data. Also, you don’t know how dirty or messy the data is. The lesson that I learnt was go through the data and did some very “hacky” interim analysis (not to be used in final of course), and find out the nature of the data - if you want to call this ‘exploratory’. Here you can get a sense of the data quality.

Then, you can start dividing the data processing into multiple stages (no set rules here). But the first step I suggest is always to make your data as clean as you can. The reason is “garbage in garbage out”. The cleanliness of the data will have ripple effect down the track. By cleanliness, I don’t mean missing data, but formatting, spelling errors, etc. The values of the variables that you are interested in.

The next step could be data transformation. So from the clean dataset, extract the variables of interest, if multiple data sources, merge them into one large data table. The result data becomes your master data.

After you have your master data, you can create little extraction from this master data for specific needs, such as performing analysis, plotting. I found I like to use a subset of the master data to plot, especially some ggplot prefer the “long” table rather than “wide”.

Lesson: Save intermediate data files

This tip will most likely save you time but don’t overdo this.

You can create your data script such that you will perform the data processing from the raw data every time you run the specific analysis. However, this will slow down your analysis because the data processing is the same every time you run it. So once you are happy with your master data, then save a copy of the data table structure into a file (serialize it), save(object, file=filename) in R, then you can save time by just loading this file back and run your analysis, using load(file=filename).

What I mean by ‘overdo this’ is, if you have multiple stages, just stored the main result of the stages that you think you won’t change often. Otherwise, you might lose track of which version of the data files were sourced from which, and then ended up having to start from scratch again.

Summary

So these are the three lessons that I can think of so far but will keep posting whenever I can think of some that could be useful for my future self, and you if you are reading and find them helpful.