|Year : 2020 | Volume
| Issue : 2 | Page : 107-112
Statistical corner: Using R to build, analyse and plot clinical neurological datasets
Mikko Jaakko Pyysalo1, Teemu Vesterinen2
1 City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic, Brain Pathology Research Group, University of Tampere, Tampere, Finland
2 Hemorrhagic, Brain Pathology Research Group, University of Tampere; Vestigo Data Mining, Development Engineer, Tampere, Finland
|Date of Submission||28-Nov-2020|
|Date of Acceptance||02-Dec-2020|
|Date of Web Publication||3-Feb-2021|
Dr. Mikko Jaakko Pyysalo
City of Tampere, Oral Health Services; Oral and Maxillofacial Unit, Tampere University Hospital; Hemorrhagic, Brain Pathology Research Group, University of Tampere, Tampere
Source of Support: None, Conflict of Interest: None
Introduction: In the field of medical research, large volumes of data need to be analysed accurately, and it is crucial to pre-process the data before it can be analysed. The 'R' environment is a programming language and environment for statistical computing and graphics suitable for the analysis of data sets.
Objectives: To provide examples on how to utilise the R language for data processing, and its usefulness for medical researchers.
Materials and Methods: Two real world datasets, ie, data for: 'Effect of Morning Blood Pressure Peak on Early Progressive Ischemic Stroke: A Prospective Clinical Study' and data for: 'Impact of early surgery of ruptured cerebral aneurysms on vasospasm and hydrocephalus after SAH: our preliminary series' have been used to present an example for two different approaches for the process of data analysis using R.
Results: Accurate and tidy data sets were obtained.
Conclusions: R is a reliable environment for the processing of large data sets.
Keywords: Analysis, data, neurology, statistics
|How to cite this article:|
Pyysalo MJ, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets. J Cerebrovasc Sci 2020;8:107-12
|How to cite this URL:|
Pyysalo MJ, Vesterinen T. Statistical corner: Using R to build, analyse and plot clinical neurological datasets. J Cerebrovasc Sci [serial online] 2020 [cited 2021 Sep 17];8:107-12. Available from: http://www.jcvs.com/text.asp?2020/8/2/107/308633
| Introduction|| |
In the field of medical research, datasets are getting bigger due to a variety of reasons. For example, modern sequencing technology can produce over 1000 gigabytes of genomic data in a day. Researchers need specific tools to be able to handle big datasets. It is crucial to perform data pre-processing before it can be analysed. In this article, we present an overview and examples of how to use one tool for big data analyses: the R environment. The example datasets presented here are rather small in size to allow performing the analyses in every laptop. We show how to combine a classical approach with modern wide data-screening procedures in the field of data analysis. All the code examples and plots are provided in the supplemental data.
R is a programming language and environment for statistical computing and graphics. It provides a wide variety of statistical and graphical techniques being also highly extensible. R can be downloaded to Windows, MacOS and Linux platforms from https://www.r-project.org/webpage. All the necessary basic information regarding downloading and usage are available at the webpage as well. RStudio is an integrated development environment for R. Free RStudio can be downloaded from https://rstudio.com/webpage. You can use R without RStudio, but there are some very useful properties in RStudio such as console; syntax-highlighting editor and tools for plotting, history, debugging and workspace management, which will help the user in the long run. Basic statistical analyses can be performed using base R, but there are extensions, called packages, to make more advanced analyses possible. R packages are collections of functions and data sets developed by the R community. For example, a great package for modern data science is tidyverse by Hadley Wickham. See https://www.tidyverse.org/for further information.
| Downloading Data to R|| |
First, you should tell R where the working directory is located. This is the directory, where the data can be read from and written to. You can set the working directory using RStudio functionalities or by executing command setwd(). My “R_projects” folder is placed on my desktop and the command I write is setwd ('~/username/desktop/R_projects'). If you use RStudio, you need to define this working directory (wd) only once.
R supports many different data formats. You can create a dataset as a simple text file (.doc .txt and many other formats) or using a spread sheet programs like Excel.
read.table() function reads datasets from .txt, .doc and .docx files. .csv files are read using read.csv() function. See further instructions on https://www.rdocumentation.org/packages/utils/versions/3.6.2/topics/read.table
Other than text, files can be read into R using functions (e.g., tidyverse package): read_sas() for SAS data, read_sav() for SPSS data, read_dta() for Stata data and read_excel() for Excel data.
| Creating a Dataset|| |
As an example, you can create a simple dataset: Write a .txt file:
sex; age; treatment; result
M; 65;operative; good
F; 39;endovascular; moderate
M; 79;operative; poor
and store it to your working directory. Name the file as 'test_data.txt'. Now you can read the dataset to R by a simple command:
test_data <- read.table('test_data.txt', header = TRUE, sep = ';')
Header = TRUE means that the first line is described as a header line. The header line represents column variable names and sep = ';' tells that variables are separated by a semicolon. Now, type test_data and you will see the data frame you created.
When naming variables, you should avoid arithmetic characters and commas in variable names (+, -,/, ^, etc.) as they might cause problems during analyses. In variable names, R accepts letters, numbers, '.'and '_'. If a variable name contains more than one word, snake case is a good option: 'age_in_years' is a good and clear variable name. It depends on your personal preferences which way to use: ageInYears, age.in.years are good options as well. My personal preference is to use snake case (age_in_years) because there will be enough space between the words and the readability is good. Different countries use different symbols to separate decimal numbers and thousands. R uses only decimal point, everything else ends up in error. Therefore, 3.0 is right and 3,0 is wrong.
Nowadays, it is more comfortable to create a dataset using Excel or any other spreadsheet platform than comma (tab and space are common separators, too) separated text files. This might look confusing especially in case of larger datasets. The same principles, regarding variable names and values, should be noticed from the very beginning of the dataset building to avoid subsequent difficulties. It is a good idea to keep all the variable names as informative (but short and clear) as possible. Remember to store cells containing numeric data as numeric cells and character data as character, etc.
In tidy dataset:
- Each variable forms a column
- Each observation forms a row
- Each type of observational unit forms a table.
When the structure of a dataset is tidy, all the necessary analyses can be performed with minimal extra effort. Cleaning a messy dataset is possible but it can be very time consuming. When analysing very large data sets, most of the time is consumed in data pre-processing. Tidyverse package contains tools for tidying the messy dataset. To avoid the time-consuming process of tidying, you should keep the principles of tidy data in mind when building up a dataset.
| Example Analyses|| |
As example analyses, we will use two real world datasets Data for: Effect of Morning Blood Pressure Peak on Early Progressive Ischemic Stroke: A Prospective Clinical Study and Data for: Impact of early surgery of ruptured cerebral aneurysms on vasospasm and hydrocephalus after SAH: our preliminary series. These datasets are open-access spread sheets. In the first section, we will perform some basic analyses and plotting using Guo et al. dataset. In the beginning, we need to install and then load required R packages. Command install.packages() downloads and installs the package into the use of R. This command does not load the package into the memory. It just makes the package available for R. All used packages must be loaded before using them using library() command.
install.packages('tidyverse'); install.packages('readxl'); install.packages('magrittr')
library (tidyverse); library (readxl); library (magrittr)
Store the dataset by Guo et al. to your R-projects directory and read the data into R.
test_data <- read_excel('Dataset.xls')
To see the overall structure of the dataset and variables, type
The easiest way to handle the variables of the dataset is to use attach() command. Type attach(test_data) and now you can access the variables just by typing a variable name. Type age and you´ll see all the age values. Type mean(age) and R will give you a mean age. Now you can explore individual variables easily using summary() command, for example summary(Age) gives you a brief summary of age distribution.
Then, we can start doing some plotting and analyses. First, we´ll find out if age explains body mass index (BMI). There are many ways for plotting in base R, but tidyverse package contains versatile and widely used functions for creating publicable plots. In the following examples, we will use pipe-operator %>% from magrittr package. First, let´s create a plot [Figure 1] using tidyverse functionalities.
First, transfer test_data dataset for the next function, then select sex, age and BMI variables and move for the next phase. Create an empty plot, where age is on x-axis and BMI on y-axis. + symbol tells R that next components are added to the plot as layers on top of each other. Add colour-coded (by sex) points and two linear model lines. Method = 'lm' means that line is drawn using linear model, further defined by formula= y ~ x. You can either draw standard error zone (se=T) or not (se=F) of which T represents TRUE and F FALSE. The line colours are defined by aes(colour=Sex) similarly with previous line. R default colours are good, but sometimes you need to choose the colours by yourself. The last line tells R to stain female points and lines black and male points and lines red.
select(Sex,Age, BMI) %>%
ggplot(mapping = aes(x=Age, y=BMI) ) +
geom_smooth(method = 'lm', se= F, formula= y ~ x, aes(colour=Sex)) +
scale_color_manual(values = c('F' = 'red', 'M' = 'black'))
After plotting, we will define Pearson's correlation coefficients for both groups. The first line tells R to transfer test_data dataset for further processing and store the final result as male_correlation. Then, we´ll select only the variables of interest and filter females (first example) and males (second example) away. Next, operator %$% is used when dataset is passed to further calculations. Cor() function (with default parameters) gives Pearson's correlation coefficient for BMI/age relationship. An argument use='complete.obs' tells R to exclude pairs with missing values from the analyses. A line starting with # is a comment line. Using comments helps other people to understand your analysis protocol.
select(Sex,Age, BMI) %>%
cor(BMI, Age, use='complete.obs')
select(Sex,Age, BMI) %>%
To see the coefficients, type
The same thing can be done faster using tidyverse functionality. The first two lines should be familiar by now. The third line groups the data into two subsets by sex. Finally, summarise() function performs correlation analyses simultaneously for both groups.
select(Sex,Age, BMI) %>%
summarise(cor(BMI, Age, use='complete.obs'))
In males, it seems that BMI decreases by age, but is it a significant finding? Next, we will filter females away from the dataset and perform a simple regression analysis to see the answer. General linear model glm() requires the name of the dataset and a formula (in this case BMI~Age, of which BMI is the variable to be explained by Age) to perform regression analysis.
select(Sex,Age, BMI) %>%
male_model <- glm(data = male_test_data, BMI~Age )
We show more advanced analyses using second dataset by D'Andrea et al.
New packages (GGally, car and rgl) are installed, and other necessary packages are loaded and datasets 'Foglio1' sheet is read into R from working directory:
install.packages('car'); install.package('GGally'); install.packages('rgl')
library(car); library(rgl); library(tidyverse); library(readxl); library(GGally)
data_x<-read_excel('dati ultra early aneurysm.xlsx', sheet='Foglio1')
Just like we mentioned earlier, most of the big data analysis projects contain data preparation. In this dataset, variable names contain other characters than letters and numbers. A good way is to allow only those in variable names to diminish the risk of subsequent problems. Next line shows, how gsub() function is used to replace everything else but alphabets and numbers ([^[:alnum]]) with '_' in variable names. [^[:alnum]] is a regular expression, which are sequences of characters that define a search pattern. They are usually used by string-searching algorithms in programming languages. See https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf for further instructions.
names(data_x) <- gsub('[^[:alnum:]]', '_', names(data_x))
In these data, some variable values that should be numeric are presented as character format. They should be converted to numeric.
Some string values need to be cleaned: value 'no' should be replaced with NA, which represents missing value in R.
data_x[data_x$hematoma_volume_mm_3 == 'no', 'hematoma_volume_mm_3'] <- NA
An example of R handling values in table format is shown above. We can alter values in table by defining these variables in square brackets. The first value inside the brackets indicates the row and the second value indicates the column [rows, columns]. The example finds from column 'hematoma_volume_mm_3' every value where column 'hematoma_volume_mm_3' has value 'no' in a row and replaces those values with NA.
After conversion from string to NA, it can be converted to numeric:
Now, the dataset is prepared for plotting and analyses. First, let us plot the relationship between the age and the timing of the surgery. The protocol is similar to the one described earlier. We select variables of interest and plot them grouped by sex [Figure 2].
select(Sex,Age, Timing_surgery) %>%
ggplot(mapping = aes(x=Age, y=Timing_surgery) ) +
geom_smooth(se= F, formula= y ~ x, aes(colour=Sex)) +
scale_color_manual(values = c('F' = 'red', 'M' = 'black'))
Then, we will create a new variable: categorical age. cut() function cuts a numeric variable into smaller fractions and gives each fraction a defined label. Again, summary() function gives us basic information (we need quartiles in this case) about the variable.
Then, we will create a categorical age variable (with four ticks) and plot a boxplot [Figure 3] that shows the relationship of the categorical age and the timing of the surgery. A function mutate() creates a new variable from the existing ones. In this example, we will tell R to take data_x dataset, create a variable called categorical age by cutting the existing Age variable into four pieces, and name them as '29_49', '49_62.5', '62.5_69' and '69_85'.
mutate(categorical_age = cut(Age, breaks = c(29,49,62.5,69,85), labels = c('29_49', '49_62.5', '62.5_69', '69_85'), include.lowest = T))
Creating a dichotomic age variable (categorical_age_binary) works similarly
mutate(categorical_age_binary = cut(Age, breaks = c(29,62.5,85), labels = c('young', 'old'), include.lowest = T))
We can perform a Mann–Whitney test to find out if older people have bigger aneurysms.
Nowadays, in the era of big data, exploring big datasets can be time consuming if we plot models one by one. R provides a very useful function ggpairs() to create bigger sets of plots on one sheet. In next examples, there is one new feature of R: it is possible to select specific rows and columns using [rows, columns] definition. In next examples, ggpairs(data_x(,c(4,5,6,8,9,11,12,14,17,21,22)) means that ggpairs() function is executed to data_x all rows (there is nothing before the comma) from columns 4, 5, 6, 8, 9, 11, 12, 14, 17, 21 and 22. If you would like to select rows 1 to 10 and columns 5 to 6, you would type data_x[c(1:10),c(5:6)]. In the following examples, data points are coloured according to the age or sex. All the figures are shown in the supplementary files.
Ggpairs() calculates correlations between given variables and creates scatter plots about them. By default, the correlations are presented in upper half, scatter plots are presented in lower half and a variable distribution in the diagonal.
It is easy to define some extra features, which help in the analysis. Just like we presented earlier, we can plot data with categorical groups. We are now plotting separate groups defined by sex. The variable distributions are presented with alpha level 0.4, which makes the lines transparent. We also want a regression line in the lower half with a variance. This can be done by lower=list(continuous='smooth'). Combo='dot' tells that the variable pair will be plotted with dots.
ggpairs(data_x(,c(4,5,6,8,9,11,12,14,17,21,22)), aes(color = data_x$Sex, alpha=0.4), lower=list(continuous='smooth', combo='dot'))[Supplementary Figure 1].
It is easy to find out variable dependencies by using different colour groups. Ggpairs() can plot in many different ways and you can define your own functions also.
ggpairs(data_x(,c(4,5,6,8,9,11,12,14,17,21,22)), aes(color = data_x$categorical_age, alpha=0.4), lower=list(continuous='smooth', combo='dot')) [Supplementary Figure 2]
ggpairs(data_x(,c(5,6,14,21,22)), aes(color = data_x$Sex, alpha=0.4), lower=list(continuous='smooth', combo='dot')) [Supplementary Figure 3]
ggpairs(data_x(,c(5,6,14,21,22)), aes(color = data_x$categorical_age, alpha=0.4), lower=list(continuous='smooth', combo='dot')) [Supplementary Figure 4]
When three-dimensional interactive plots are needed, scatter3d() function is executed. This function requires all the categorical variables as factors:
groups=data_x$Sex, fit = 'smooth')
scatter3d(x=data_x$Age, y=data_x$Fisher, z=data_x$aneurysm_volume_mm_3, groups=data_x$categorical_age, fit = 'smooth')
It is possible to zoom and see the plot from different angles.
| Conclusion|| |
In these examples, we showed two different approaches to data analysis. The first example was a rather straightforward protocol in which analyses could be started immediately. On the contrary, in the second protocol, more extensive data preparation was needed and plotting was done in larger scale. Nowadays in the era of big data, these data preparation and whole data-plotting procedures are often needed before analyses.
| References|| |
Yi G, Yitao H. Data for: Effect of Morning Blood Pressure Peak on Early Progressive Ischemic Stroke: A Prospective Clinical Study, Mendeley Data 2019. doi: 10.17632/r8j84ksgf9.1.
D'Andrea G. Data for: Impact of early surgery of ruptured cerebral aneurysms on vasospasm and hydrocephalus after SAH: our preliminary series, Mendeley Data 2020. doi: 10.17632/jwgcwj58j8.1.
[Figure 1], [Figure 2], [Figure 3]