Basic inference in R

Navigating R can be daunting with the vast number of packages and options for obtaining analyses.

Here we have curated a few options for some simple inferential procedures – one and two sample inference for means – to help you get started. We assume you have some basic knowledge of R, and are providing this as a quick reference for some common, simple methods of statistical inference.

Remember to first ensure you are in the correct working directory.

Load packages and read in data

library(tidyverse)
library(readxl)

By loading the tidyverse suite of packages, we are able to easily manipulate data, as well as produce excellent graphics using ggplot.

In this script, the data file is called MYDATA.RData. Replace this with the name of the file you wish to use.

load("MYDATA.Rdata")

Alternatively, you might read in data, for example, from an Excel file using the read_excel function; part of the readxl package.

MYDATA <- read_excel("MYDATA.xlsx")

One sample inference for the mean

In all of the code below, you will need to replace MYDATA with the name of your data frame.

You will need to use the appropriate variable names.

Start with an appropriate plot:

ggplot(MYDATA, aes(x=NUMERICAL_VARIABLE)) +
  geom_histogram() +
  labs(x="NICE AXIS LABEL")

ggplot(MYDATA, aes(x=NUMERICAL_VARIABLE)) +
  geom_dotplot() +
  labs(x="NICE AXIS LABEL") +
  scale_y_continuous(breaks=NULL)

ggplot(MYDATA, aes(x=NUMERICAL_VARIABLE)) +
  geom_boxplot() +
  labs(x="NICE AXIS LABEL") +
  scale_y_continuous(breaks=NULL)

Obtain the summary statistics:

MYDATA %>%
summarise(Mean = mean(NUMERICAL_VARIABLE),
          SD = sd(NUMERICAL_VARIABLE),
          n = n())

If there are missing data in the numerical variable, then the following code will be required:

MYDATA %>%
summarise(Mean = mean(NUMERICAL_VARIABLE, na.rm = TRUE),
          SD = sd(NUMERICAL_VARIABLE, na.rm = TRUE),
          n = sum(!is.na(NUMERICAL_VARIABLE)))

A simple confidence interval for a population mean:

t.test(MYDATA$NUMERICAL_VARIABLE, conf.level=0.95)

This assumes that the data arise from a random sample from a Normal distribution.

Inference for the mean difference: Paired samples

If the differences are stored in the data frame, the code for one sample inference for a mean can be used.

Appropriate graphs are based on the difference scores. For example, if the paired data are in two separate columns:

MYDATA %>%
mutate(Differences = Column1-Column1) %>%
ggplot(aes(x=Differences)) +
  geom_boxplot() +
  labs(x="NICE AXIS LABEL") +
  scale_y_continuous(breaks=NULL)

The statistical inference can be carried out using the following code:

t.test(MYDATA$Column1, MYDATA$Column2, paired = TRUE, conf.level=0.95)

This assumes that the differences are a random sample from a Normal distribution.

Inference for the difference of means: Independent samples

Start with an appropriate plot:

ggplot(MYDATA, aes(y = CATEGORICAL_VARIABLE, x = NUMERICAL_VARIABLE)) +
  geom_dotplot(binaxis = 'x', dotsize = 0.5) +
  labs(y ="NICE Y-AXIS LABEL", x = "NICE X-AXIS LABEL")


ggplot(MYDATA, aes(y = CATEGORICAL_VARIABLE, x = NUMERICAL_VARIABLE)) +
  geom_boxplot(width = 0.4) +
  labs(y ="NICE Y-AXIS LABEL", x = "NICE X-AXIS LABEL")

Use the following code if the numerical variable is stored in one variable and the grouping variable in a second variable.

Summary statistics:

MYDATA %>%
group_by(CATEGORICAL_VARIABLE) %>%
summarise(Mean = mean(NUMERICAL_VARIABLE, na.rm = TRUE),
SD = sd(NUMERICAL_VARIABLE, na.rm = TRUE),
n = sum(!is.na(NUMERICAL_VARIABLE)))

Confidence interval and t-test, without assuming equal variances:

t.test(NUMERICAL_VARIABLE ~ CATEGORICAL_VARIABLE, data = MYDATA, conf.level=0.95)

Confidence interval and t-test, assuming equal variances:

t.test(NUMERICAL_VARIABLE ~ CATEGORICAL_VARIABLE, data = MYDATA, var.equal=TRUE, conf.level=0.95)

Both methods assume that data in each group arise