Tuesday, 19 February 2019

Upskilling: learning to code in R (Erika Duan)

Image by Erika Duan
I learnt to code in R out of necessity.

First, part of my research involved the exploration and visualisation of extremely large datasets.

Second, it never hurts to learn a new job skill.

R is a programming language used mostly by statisticians, bioinformaticians and data scientists (due to its vast library of statistical, data exploration and visualisation tools). Code is written via a program called R studio. As a beginner, I found the R studio environment very friendly, as it allows you to write code, test it in small segments (or ‘chunks’) and quickly visualise your results.

When I first started learning R, I had no experience with coding and was utterly clueless. My first coding session happened at an introductory workshop held by Research Bazaar (ResBaz) in 2016. ResBaz is an annual research fair, held early in the year at the University of Melbourne, which promotes computer literacy and digital research. Although I emerged from the workshop only slightly more knowledgeable than before, the biggest impact of attending ResBaz was its introduction to online research communities via Twitter. This allowed me to connect with different data science communities and researchers, and their posts have led me to useful resources for learning how to code.


But back to my coding struggles...

In retrospect, I held a misconception that learning coding was very similar to learning a new foreign language. I imagined I would first spend many months learning the basic principles of coding (i.e. specific words and how to join them), before slowly constructing simple but functional sentences. In practice, however, this approach would have delayed my own research analysis by many months and years!

Image by Erika Duan
The most efficient way to learn how to code, once you’ve attended a few introductory courses, is to Google all specific questions and practice on different datasets (especially your own research datasets). Practicing coding exercises is very helpful as coding packages evolve and supercede each other, so that language mastery is a dynamic process rather than a static end goal.

You will never ever know how to code everything, but you may learn new things more quickly if you are good.

Image by Erika Duan

Since my early days as a clueless beginner, some of my important coding milestones have been:

Attendance at introductory classes

There are a lot of excellent introductory resources available online, including ones from Software Carpentry (https://swcarpentry.github.io/r-novice-gapminder/) and the John Hopkins data science lab (https://jhudatascience.org/chromebookdatascience/). The La Trobe University digital research team (https://www.latrobe.edu.au/research-infrastructure/digital-research) regularly hosts introductory coding sessions where you can code with your peers and, more importantly, access help from friendly instructors.

Learning how to find help

Google is the best resource when you want to learn how to write new lines of code. There are several avenues for getting help when you are stuck:
  • Googling will usually direct you to a solution via posts from online help forums like https://stackoverflow.com
  • Official help files and package documentations are also kept in RStudio and provide a concise guide for using a function and its accessory options.
  • R Cheatsheets graphically summarise many commonly used coding packages into one pdf and these can also be accessed via R studio or online (https://www.rstudio.com/resources/cheatsheets/).
  • Many helpful R tutorials exist in the form of blog posts and these are often aggregated by sites like https://www.r-bloggers.com/. These are a particularly excellent sources of help if you are interested in using a specific new package.
When seeking help, it is important to learn how to write clear and specific coding questions. In general, simplifying your question (by removing research-specific jargon) and addressing it in reference to an example dataset can help others to clearly visualise your coding problem.

Learning how to read coding errors

As a beginner, encountering coding errors can feel daunting. To counter this, it is helpful to treat coding errors as friendly editing prompts. Error messages help pinpoint the location of the coding error and provide simple clues regarding the nature of the error (whether it is an annotation or data formatting error etc.). From experience, beginner-level coding errors are more likely to involve simple annotation mistakes that can be easily corrected (i.e. missing out on a bracket, misspelling a word or writing part of the code in reverse order).

Learning how to use the tidyverse package

A package contains a collection of functions designed for a specific purpose, like calculating p values or drawing a scatter plot. Different packages are often used together to solve complex data analysis questions. One of my favourites is called tidyverse.

The tidyverse package operates around the concept of tidy data, where each data variable is stored in a new column, each observation is stored in a new row, and the data order is always preserved. Using the tidyverse package allows you to easily select specific dataset parameters, filter for specific observation types, create new dataset parameters and reclassify and summarise existing subsets of data. Tidyverse also contains a data visualisation package ggplot2, which is particularly excellent for plotting graphs for data visualisation. Mastering tidyverse was my 'Eureka!' moment in coding and, today, tidyverse functions constitute a core component of my data analysis pipelines. The official guide to using tidyverse, called R for Data Science, is freely available at https://r4ds.had.co.nz.

Learning how to organise coding projects:

The biggest downfall in jumping straight into coding is that you miss out on learning about project organisation until much later (until you have five different data analysis folders in the same project and 4 different graphs labelled with a similar name...).

Gold standards for project organisation include assigning both computer and human friendly names to projects and files, maintaining separate folders for raw data (treated as non-writable files) and data analysis (treated as temporary or modifiable files) and utilising version control to track project changes. Excellent articles on this topic have been written by the Rstudio software engineer and academic Jenny Bryan (@JennyBryan on Twitter; her website is here: https://jennybryan.org/).

Overall, learning to code in R has been incredibly useful to me.

Today, we are surrounded by enormous amounts of data and the ability to analysis them can help generate very useful insight. As visual examples are helpful for learning to code, I’ve provided an example of how coding in R can be used to make interesting insights from the 2018 NHMRC funding outcomes (https://github.com/erikaduan/R-tips/blob/master/NHMRC_analysis_2018.md).

---------------------------------------


Erika Duan is a postdoc in the Chen T-cell laboratory, Department of Biochemistry and Genetics, LIMS, La Trobe University. 

Her research utilises coding in R and 'big data' analysis workflows to understand how immune cells can detect and respond to pathogen invasions. 

Erika tweets at @ErikaDuan, blogs at https://inscientist.wordpress.com/ and can now be found writing R coding tips on https://github.com/erikaduan/R-tips

No comments:

Post a Comment