Lessons Learnt from R Data Processing Workflow
27 Sep 2019(Image by Tony DePrato)
Doing statistical analysis in my small research project has taught me a few lessons about how to organise the data. So I have summarised some of what I have learnt here. It might not be the best practice for all, but I would like to think of them as my âgood data hygieneâ.
Lesson: Always keep your raw data separateâŚ.I mean really raw
No matter how simple your dataset is, always keep the raw format of your data and organise it in such a way that it cannot be tampered with. Sometimes I have the temptation just to open the dataset in Excel and try to reformat the dates to make it nicer, but, once you tempered with the data, then you have lost the origin. Six months down the track, you might think the data is the raw source, but it is not anymore.
One way to tackle this temptation is to create a folder, purely to store ârawâ data. And if you are paranoid that you would temper it âunconsciouslyâ, then you can always set the folder to âRead-onlyâ, which give you another protection/warning. Then, create another folder just for processing and copy the raw data into this folder and start modifying it to suit your need.
Lesson: Plan your data processing in multiple stages
âPlan for data processingâ is probably common knowledge for everyone. However, most of the time, I think planning doesnât work quite well for me in terms of data processing - especially you havenât decided on concrete how you are going to analyse the data. Also, you donât know how dirty or messy the data is. The lesson that I learnt was go through the data and did some very âhackyâ interim analysis (not to be used in final of course), and find out the nature of the data - if you want to call this âexploratoryâ. Here you can get a sense of the data quality.
Then, you can start dividing the data processing into multiple stages (no set rules here). But the first step I suggest is always to make your data as clean as you can. The reason is âgarbage in garbage outâ. The cleanliness of the data will have ripple effect down the track. By cleanliness, I donât mean missing data, but formatting, spelling errors, etc. The values of the variables that you are interested in.
The next step could be data transformation. So from the clean dataset, extract the variables of interest, if multiple data sources, merge them into one large data table. The result data becomes your master data.
After you have your master data, you can create little extraction from this master data for specific needs, such as performing analysis, plotting. I found I like to use a subset of the master data to plot, especially some ggplot prefer the âlongâ table rather than âwideâ.
Lesson: Save intermediate data files
This tip will most likely save you time but donât overdo this.
You can create your data script such that you will perform the data processing from the raw data every time you run the specific analysis. However, this will slow down your analysis because the data processing is the same every time you run it. So once you are happy with your master data, then save a copy of the data table structure into a file (serialize it), save(object, file=filename)
in R, then you can save time by just loading this file back and run your analysis, using load(file=filename)
.
What I mean by âoverdo thisâ is, if you have multiple stages, just stored the main result of the stages that you think you wonât change often. Otherwise, you might lose track of which version of the data files were sourced from which, and then ended up having to start from scratch again.
Summary
So these are the three lessons that I can think of so far but will keep posting whenever I can think of some that could be useful for my future self, and you if you are reading and find them helpful.