Week 5 Activities (Correlation in R)

In this week’s workshop, we will learn how to examine relationships between variables using correlation analysis. We will focus on both simple correlations (between two variables) and multiple correlations (between several variables), and we will learn how to:

- Run descriptive statistics and check assumptions using jmv

- Conduct a simple correlation in R.

- Conduct a multiple correlation R.

- Produce clear and informative correlation plots

- Conduct an a priori power analysis for correlations

As a reminder, there are three types of correlations, as illustrated in the image below.

Let’s Get Set Up

Let’s begin by ensuring your working environment is ready for today’s session. Open RStudio or Posit Cloud and complete the following tasks to set everything up.

Activity 1: Set Up Your Working Directory & R Script for this week

One of the first steps in each of these workshops is setting up your working directory. The working directory is the default folder where R will look to import files or save any files you export.

If you don’t set the working directory, R might not be able to locate the files you need (e.g., when importing a dataset) or you might not know where your exported files have been saved. Setting the working directory beforehand ensures that everything is in the right place and avoids these issues.

  1. Click:
    Session → Set Working Directory → Choose Directory

  2. Navigate to the folder you created for this course (this should be the same folder you used for previous workshops).

  3. Create a new folder called week5 inside this directory.

  4. Select the week5 folder and click Open.

Don’t forget to verify your working directory before we get started. You do this by typing in the following command into the console

getwd()
[1] "C:/Users/0131045s/Desktop/Legacy/Programming/R/Workshops/rintro/activities/week5"

As in previous weeks we will create an R script that we can use for today’s activities. This week we can call our script 05-correlations

  1. Go to the menu bar and select:
    File → New File → R Script

    This will open an untitled R script.

  2. To save and name your script, select:

    File→ Save As, then enter the name:

    05-correlations

    Click Save

To make the output of our results appear more intuitive, we are going to turn off scientific notation. Scientific notation is a way of minimising space when reporting results with multiple decimal places. For example, if p = 1.1e-7, this corresponds to p = 0.00000011 (i.e., the decimal point is moved seven places to the left). In psychological research, extremely small p-values are typically reported as: p < .001. Therefore, scientific notation is not necessary for our purposes. We can turn it off by running:

options(scipen = 999)

Activity 2: Installing / Loading R Packages

Today we will use the following packages: • jmv for descriptive statistics and assumption checks • pwr for power analysis • car for additional assumption testing • correlation for running correlation analyses • performance for diagnostic checks • psych, ggplot2, and ggcorrplot for visualisations

REMEMBER: If you see the error - “Error in library(package name): there is no packaged calledpackage name”, you’ll need to install the package first by editing the following for your console:

install.packages("package name") #replace "package name" with the actual package name

Now load the required packages:

library(jmv) # this will help us run descriptive statistics 
library(pwr) # this will enable us to conduct power analysis  
library(car) # this is for checking assumptions
library(correlation) #this runs our correlation
library(performance) # this helps with our assumption checks
library(psych) # this will be helpful for our multiple correlation graph
library(ggplot2) #for generating plots
library(ggcorrplot) # for generating matrix correlation plots

Activity 3: Download and load in your dataset

Today we will analyse data from a study on sleep quality, personality factors, and psychological wellbeing. You will find the dataset on Canvas under Week 5.

  • SleepQuality.csv → the dataset we will use today

Step 1: Download the file

  1. Download SleepQuality.csv from Canvas.
  2. Move it into your week5 folder (this should be your working directory).

Step 2: Load the dataset into R

Once the file is in your week5 folder, load it into R and save it as a dataframe called df_sleep.

df_sleep <- read.csv("SleepQuality.csv")

Correlations

This dataset comes from a cross-sectional study examining the relationship between sleep difficulty, anxiety, and depression.

Before doing any analysis, it is always good practice to quickly inspect your dataset. The head() function prints the first few rows:

head(df_sleep)
  Participant SleepDifficulty  Optimism Depression  Anxiety
1           1        12.32332  6.895218   25.73838 50.14666
2           2        12.03404 12.593687   26.36019 50.59385
3           3        12.05411  5.389429   27.75646 50.47726
4           4        13.47842  9.192558   25.01543 53.32908
5           5        16.32334  5.869162   32.23879 70.42495
6           6        18.55490  5.819719   36.75899 73.08169

Now that our environment is set up and our dataset is loaded, we are ready to run descriptive statistics and check our assumptions (using the same descriptives approach you learned last week).

Activity 4: Descriptive Statistics and Assumption Checks

We are interested in examining the relationship between depression and anxiety.

Specifically, we predict that higher depression scores will be associated with higher anxiety scores.

Hypotheses

H₀: There is no significant correlation between depression and anxiety.

H₁: Higher depression scores will be positively correlated with higher anxiety scores.

Because we are examining the relationship between two continuous variables, the appropriate analysis is a simple correlation.

Since our hypothesis specifies a direction (positive association), this is a directional hypothesis. Note that as we do specify the direction of the association / correlation that this is a directional hypothesis

Choosing the Correct Type of Correlation

There are two main types of correlation:

  • Pearson’s correlation – used when assumptions are met
  • Spearman’s correlation – used when assumptions are violated

For this workshop, we will use Pearson’s correlation, provided its assumptions are satisfied.

Assumptions of Pearson’s Correlation

Before running the correlation, we must check the following:

  1. Both variables are continuous (interval or ratio scale)

Interval data = Data that is measured on a numeric scale with equal distance between adjacent values, that does not have a true zero. This is a very common data type in research (e.g. Test scores, IQ etc).

Ratio data = Data that is measured on a numeric scale with equal distance between adjacent values, that has a true zero (e.g. Height, Weight etc).

Here we know our two variables are outcomes on a wellbeing test, as such they are interval data and this assumption has been met

  1. Each participant has a score on both variables (no problematic missing data)

A quick way to check this assumption is to see if there is any missing data. If there isn’t then this assumption is met. Note that number of missing data points is one of the values you automatically get when using the descriptives function.

  1. Both variables are approximately normally distributed
  2. The relationship between the variables is linear
  3. The data are homoscedastic

Checking Our Assumptions

1. Continuous Variables

Both depression and anxiety are measured using psychological scales. These are treated as interval-level data, meaning the distance between values is equal. This assumption is therefore met.

2. Missing Data

A quick way to check for missing data is via descriptive statistics. If there are no missing values for either variable, this assumption is satisfied.

descriptives(df_sleep, # our data
             vars = c("Depression", "Anxiety")# our two variables
)

 DESCRIPTIVES

 Descriptives                                     
 ──────────────────────────────────────────────── 
                         Depression    Anxiety    
 ──────────────────────────────────────────────── 
   N                            150         150   
   Missing                        0           0   
   Mean                    25.05236    50.30168   
   Median                  25.14149    50.21348   
   Standard deviation      6.173965    12.34318   
   Minimum                 8.203621    16.32001   
   Maximum                 40.04161    78.15740   
 ──────────────────────────────────────────────── 

You can see that no participants are identified as “missing”.

3. Normality

Normality can be assessed visually (histogram and Q–Q plot) and statistically (Shapiro–Wilk test).

We can obtain descriptive statistics and check normality simultaneously:

descriptives(df_sleep, # our data
             vars = c("Depression", "Anxiety"), # our two variables
             hist = TRUE, # this generates a histogram
             dens = TRUE, #draws a density plot over our histogram
             sw = TRUE, # this runs shapiro-wilks test
             qq=TRUE) # this generates a qq plot 

 DESCRIPTIVES

 Descriptives                                      
 ───────────────────────────────────────────────── 
                         Depression    Anxiety     
 ───────────────────────────────────────────────── 
   N                            150          150   
   Missing                        0            0   
   Mean                    25.05236     50.30168   
   Median                  25.14149     50.21348   
   Standard deviation      6.173965     12.34318   
   Minimum                 8.203621     16.32001   
   Maximum                 40.04161     78.15740   
   Shapiro-Wilk W         0.9942665    0.9914144   
   Shapiro-Wilk p         0.8206004    0.5009269   
 ───────────────────────────────────────────────── 

For the histograms, we are looking for a roughly “bell-shaped” distribution. It does not need to be perfect. Real data almost never are. What we are trying to avoid is:

  • Strong skew (lots of values piled up on one side)
  • Very long tails
  • Obvious extreme outliers

For the Q-Q plot, here’s what to look for:

  • If the dots fall roughly along the straight diagonal line → the data are approximately normal.
  • If the dots form a clear curve (like an S-shape) → there may be skew.
  • If the ends of the plot bend sharply away from the line → there may be heavy tails or outliers.

For both depression and anxiety, the points closely follow the diagonal line, with only very small deviations at the extremes. This suggests the distributions are very close to normal.

For the Shapiro-Wilk’s test, it gives us a p-value. That p-value tells us whether the data significantly deviates from a normal bell-shaped distribution.

  • If p > .05 → we do not have evidence that the data deviate from normality.
  • If p < .05 → the test detects a deviation.

Our result show that for both depression (W = 0.99, p = .82) and anxiety (W = 0.99, p = .50), both p-values are well above .05, which means there is no statistical evidence that either variable violates normality.

Looking at the histograms, the Q–Q plots, and the Shapiro–Wilk tests together, both depression and anxiety appear to be approximately normally distributed.

This means the normality assumption for Pearson’s correlation has been met, and we can proceed with the analysis.

4. Linearity

Pearson’s correlation assumes that the relationship between variables is linear.

We can examine this using a scatterplot:

plot(df_sleep$Depression, df_sleep$Anxiety)

The points should form a roughly straight-line pattern rather than a curved pattern. Which they do, so we have linearity.

e. Homoscedasticity

Homoscedasticity means that the variability of one variable is similar across levels of the other variable.

There are a number of ways to check this assumption: visually via plotting of the residuals (which we will cover in Week 7 for Regression), or via a test. Here we will use a Breusch-Pagan test (1979) for heteroscedasticity.

To do this we will create a linear model, and then use the function check_heteroscasticity from the performance package.

If we get a non-significant p-value then the data is homoscedastic and our assumption has been met.

model <- lm(Depression ~ Anxiety, data = df_sleep) # here we are creating an object "model" which contatins a linear model of our DV ~ IV 
check_heteroscedasticity(model) # this function performs a Breusch-Pagan test for our model
OK: Error variance appears to be homoscedastic (p = 0.771).

We can see that our data does display homoscedasticity. We have met all our assumptions now.

As such we can use a Pearson correlation to test our hypotheses.

Activity 5: Running the Simple Correlation

Now that we have checked our assumptions, we can run the correlation.

We use the cor.test() function in R. The basic structure and syntax of the code is:

#REMINDER: This is SYNTAX/PLACEHOLDER CODE - it will not work unless you substitute the values for your dataset

cor.test(
  DataframeName$Variable1,
  DataframeName$Variable2,
  method = c("pearson", "spearman"), #we choose just 1, e.g. method = "pearson", method = "spearman"
  alternative = c("two.sided", "less", "greater") #we choose one, e.g. alternative = "two.sided" or alternative = "less" or alternative = "greater"
)

What do these arguments mean?

  • method = "pearson" → used when assumptions are met.

  • method = "spearman" → used if assumptions are violated.

  • alternative controls whether the test is:

    • “two.sided” → testing for any relationship (positive or negative)

    • “greater” → testing for a positive relationship

    • “less” → testing for a negative relationship

Because our hypothesis predicts a positive relationship, we will run a one-sided test using “greater”.

As we have a directional hypothesis where we expect a positive correlation we will run a one-sided correlation. Let’s run this correlation for depression and anxiety on our df_sleep dataset:

cor.test(
  df_sleep$Depression,
  df_sleep$Anxiety,
  method = "pearson",
  alternative = "greater"
)

    Pearson's product-moment correlation

data:  df_sleep$Depression and df_sleep$Anxiety
t = 56.595, df = 148, p-value < 0.00000000000000022
alternative hypothesis: true correlation is greater than 0
95 percent confidence interval:
 0.9708085 1.0000000
sample estimates:
     cor 
0.977668 

The correlation results show a significant correlation between the depression and anxiety scores.

Here’s how we might write up the results in APA style:

A Pearson’s correlation was conducted to assess the relationship between depression (M= 25.05, SD= 6.17) and anxiety scores (M= 50.3, SD= 12.34). There was a significant large positive correlation between depression and anxiety, r(148) = 0.98, p < 0.01. This indicates that higher levels of depression are associated with higher levels of anxiety. As such we can reject the null hypothesis.

Plotting our Results

It is always a good idea to represent our results visually. For a simple correlation, this would mean a scatterplot. while we already generated a scatterplot earlier, we can make it more informative and nicer to look at. For example, the following code will generate a scatterplot between Depression and Anxiety, with a title, appropriate labels for the x and y-axis, and a regression line of best fit.

plot(
  x = df_sleep$Depression,
  y = df_sleep$Anxiety,
  main = "Scatterplot of Depression and Anxiety",
  xlab = "Depression",
  ylab = "Anxiety"
)

abline(lm(Anxiety ~ Depression, data = df_sleep), col = "blue", lwd = 2)

What each argument in plot() does

  • x = df_sleep$Depression → This sets the variable on the horizontal (x) axis.

  • y = df_sleep$Anxiety → This sets the variable on the vertical (y) axis.

  • main = → This controls the title of the graph.

  • xlab = and ylab = → These control the axis labels.

If you want to change variables, simply replace the x and y variables with different variables.

For example:

plot(
  x = df_sleep$SleepDifficulty,
  y = df_sleep$Depression,
  main = "Scatterplot of Sleep Difficulty and Depression",
  xlab = "Sleep Difficulty",
  ylab = "Depression"
)

What does the code abline(lm(Depression ~ SleepDifficulty, data = df_sleep), col = "blue", lwd = 2) do?

This adds a straight line showing the best linear fit, where

  • lm(Anxiety ~ Depression, data = df_sleep) → This creates a linear model predicting Anxiety from Depression.

  • abline() → Draws the regression line from that model onto the existing plot.

  • col = "blue" → Changes the colour of the line.

  • lwd = 2 → Controls the thickness of the line (higher = thicker).

If you change the variables in your plot, you must also change them here to match.

For example:

plot(
  x = df_sleep$SleepDifficulty,
  y = df_sleep$Depression,
  main = "Scatterplot of Sleep Difficulty and Depression",
  xlab = "Sleep Difficulty",
  ylab = "Depression"
)

abline(lm(Depression ~ SleepDifficulty, data = df_sleep), col = "red", lwd = 3)

Multiple Correlations

In the previous activity, we examined the relationship between two variables (depression and anxiety).

However, we are often interested in understanding how several variables relate to each other at the same time.

In this dataset, we have four continuous variables:

  • Depression
  • Anxiety
  • SleepDifficulty
  • Optimism

Hypotheses

Set 1: The Relationship with Sleep Difficulty with Depression and Anxiety

H₀: There is no significant correlation between sleep difficulty with depression, and anxiety.

H₁: Higher sleep difficulty will be associated with higher depression and anxiety.

Set 2: The Relationship with Optimism with Depression and Anxiety

H₀: There is no significant correlation between optimism with depression and anxiety.

H₁: Higher optimism will be associated with lower depression and anxiety

Instead of running multiple separate correlations one at a time, we can examine all relationships simultaneously using a multiple correlation matrix.

Why Use a Multiple Correlation?

A multiple correlation allows us to:

  • Examine all pairwise relationships between several continuous variables
  • See the direction (positive or negative)
  • See the strength of each relationship
  • Adjust p-values for multiple comparisons (e.g., Bonferroni correction)

The logic is the same as before, as we are still testing associations between continuous variables.

Activity 6: Descriptive statistics and assumption checking

Before running the multiple correlation, we must ensure that all four variables meet the assumptions of Pearson’s correlation.

We already checked depression and anxiety earlier. Now we must also check:

  • SleepDifficulty
  • Optimism

Assumptions to Check

For each variable:

  1. The variable is continuous
  2. There is no problematic missing data
  3. The variable is approximately normally distributed
  4. The relationships appear linear
  5. Homoscedasticity is reasonable

Using the same steps as for the simple correlation, check the assumptions for the variables SleepDifficulty and Optimism.

descriptives(df_sleep, # our data
             vars = c("Depression", "Anxiety", "SleepDifficulty", "Optimism"), # our four variables
             hist = TRUE, # this generates a histogram
             dens = TRUE,
             sw = TRUE, # this runs shapiro-wilks test
             qq=TRUE) # this generates a qq plot 

 DESCRIPTIVES

 Descriptives                                                                      
 ───────────────────────────────────────────────────────────────────────────────── 
                         Depression    Anxiety      SleepDifficulty    Optimism    
 ───────────────────────────────────────────────────────────────────────────────── 
   N                            150          150                150          150   
   Missing                        0            0                  0            0   
   Mean                    25.05236     50.30168           12.37026     10.57763   
   Median                  25.14149     50.21348           12.37819     10.76353   
   Standard deviation      6.173965     12.34318           2.736946     3.284548   
   Minimum                 8.203621     16.32001           6.599527    0.4951533   
   Maximum                 40.04161     78.15740           18.73516     18.83997   
   Shapiro-Wilk W         0.9942665    0.9914144          0.9846297    0.9912025   
   Shapiro-Wilk p         0.8206004    0.5009269          0.0937667    0.4788844   
 ───────────────────────────────────────────────────────────────────────────────── 

Tip

If you want to try the pairs.panel function it enables you to check the visual assumptions for all the variables at once.

pairs.panels(df_sleep)

The pairs.panels() output shows:

  • Histograms on the diagonal (distribution of each variable)
  • Correlation coefficients (i.e. strength and direction of the relationship) in the upper right triangle
  • Scatterplots with regression lines in the lower left triangle

When interpreting the scatterplots: - Look for a straight-line pattern (linearity) - Check that no extreme outliers dominate the relationship - Observe the direction of the red line (positive or negative)

The red ellipses in the scatterplots indicate strength: - Narrow diagonal ellipse = strong relationship - Wide circular shape = weak relationship

Activity 7: Running the multiple correlation

Now that we have checked the assumptions for all four variables, we can run a multiple correlation.

Instead of testing one pair of variables at a time, we will examine all pairwise relationships simultaneously.

We use the correlation() function from the correlation package.

# Remember you need to edit the specific names/variables below to make it work for our data and needs

correlation(DataframeName, # our data
            select = c("Variable1", "Variable2", "Variable3", "Variable4"), # our variables
             method = c("pearson", "spearman"), #we choose just 1, e.g. method = "pearson", method = "spearman"
            p_adjust = "bonferroni") # our bonferroni adjustment for multiple comparisons - see the associated textbook chapter for a discussion of this

# Note that we do not specify the direction of our predicted correlation here as some may be positive and others may be negative

What do these arguments mean?

  • select → specifies which variables to include.

  • method = "pearson" → used when assumptions are met.

  • method = "spearman" → used when assumptions are not met.

  • p_adjust = "bonferroni" → adjusts p-values to control for inflated Type I error when running multiple comparisons.

When we run multiple correlations at once, we increase the risk of false positives. The Bonferroni correction makes the test more conservative to reduce this risk.

Note that we do not specify "greater" or "less" here, because some relationships may be positive and others negative. Therefore, this is effectively a two-sided test for each pair.

The following code runs the multiple correlation and saves it to the variable mcor. We are saving this information to a variable as we will need it later to generate a graph.

# we are saving the results of our correlation as an object called "mcor"
mcor <- correlation(df_sleep, # our data
            select = c("Depression", "Anxiety", "SleepDifficulty", "Optimism"), # our variables
             method = "pearson",
            p_adjust = "bonferroni") # our bonferroni adjustment for multiple comparisons
mcor
# Correlation Matrix (pearson-method)

Parameter1      |      Parameter2 |     r |         95% CI | t(148) |         p
-------------------------------------------------------------------------------
Depression      |         Anxiety |  0.98 | [ 0.97,  0.98] |  56.60 | < .001***
Depression      | SleepDifficulty |  0.90 | [ 0.87,  0.93] |  25.65 | < .001***
Depression      |        Optimism | -0.16 | [-0.31,  0.00] |  -1.98 | 0.298    
Anxiety         | SleepDifficulty |  0.88 | [ 0.84,  0.91] |  22.47 | < .001***
Anxiety         |        Optimism | -0.14 | [-0.29,  0.02] |  -1.73 | 0.515    
SleepDifficulty |        Optimism | -0.10 | [-0.25,  0.06] |  -1.18 | > .999   

p-value adjustment method: Bonferroni
Observations: 150

Interpreting the Output

The output provides:

  • The correlation coefficient (r)

  • The p-value (adjusted using Bonferroni)

  • Confidence intervals

Key Findings

  • Depression and anxiety were very strongly positively correlated, r(148) = .98. p < .01.

  • Depression and sleep difficulty were strongly positively correlated, r(148) = .90, p < .01.

  • Anxiety and sleep difficulty were also strongly positively correlated, r(148) = .88, p < .01.

These indicate that higher sleep difficulty is associated with higher depression and anxiety, and that depression and anxiety are closely related.

In contrast:

  • Depression and optimism were weakly negatively correlated, r(148) = −.16, p = .30.

  • Anxiety and optimism were weakly negatively correlated, r(148) = −.14, p = .52.

  • Sleep difficulty and optimism were weakly negatively correlated, r(148) = −.10, p > .99.

These weak negative correlations were not statistically significant after Bonferroni correction.

How we might write up the results in APA style?

A Pearson’s multiple correlation test with bonferroni correction was conducted to assess the relationship between depression (M= 24.26, SD= 5.89), anxiety scores (M= 48.62, SD= 12.48), sleep difficulty (M= 12.37, SD= 2.74), and optimism (M= 10.58, SD= 3.28). The test showed that there was a significant positive correlation between sleep difficulty and depression, r(148) = 0.90, p < 0.01, and sleep difficulty and anxiety, r(148) = 0.88, p < 0.01. This result indicates that people experiencing poorer sleep quality are more likely to experience depression and anxiety. The test also showed that there was a significant positive correlation between depression and anxiety r(148) = 0.98, p < 0.01. Therefore one is able to reject the null hypothesis that there is no significant correlation between sleep difficulty with depression and anxiety.

It was also hypothesised that higher optimism will be associated with lower depression and anxiety. The test showed that there was no significant correlation between optimism and anxiety r(148) = -0.14, p = 0.52 or optimism and sleep difficulty r(148) = -0.1, p = >.99. Therefore, one is unable to reject the null hypothesis.

Activity 8: Graphs

When analysing multiple correlations, reporting every correlation in text can quickly become overwhelming. A correlation matrix provides a clear visual summary of:

  • The direction of each relationship (positive or negative)
  • The strength of each relationship
  • The overall pattern across variables

Today we will use the ggcorrplot() function to visualise the correlation matrix we created in Activity 7.

ggcorrplot(
  mcor,
  colors = c("red", "white", "blue")
)

What Does This Show?

  • Blue cells represent positive correlations.

  • Red cells represent negative correlations.

  • Darker colours indicate stronger relationships.

  • Lighter colours indicate weaker relationships.

This makes it much easier to see patterns across all variables at once.

Customising the Graph

The ggcorrplot() function has many optional arguments. If you type in ?ggcorrplot to your console you can see there are many optional arguments. Try make the following changes to your graph

  1. Add a meaningful title.
  2. Try changing the display type lower and upper, see what looks best.
  3. Display the correlation values (lab = TRUE).
  4. Change the colours of the correlations
  5. Change the method to “circle” instead of squares.
  6. Try some of the arguments, see what you think looks best
ggcorrplot(
  mcor,
  type = "lower",
  method = "circle",
  lab = TRUE,
  title = "Correlation Matrix of Psychological Variables",
  colors = c("red", "white", "green")
)

Saving the graph using `ggsave()

Rather than exporting manually from the Plots panel, it is better practice to save graphs directly using code. This improves reproducibility.

First, assign the plot to a variable name, then use ggsave() to export it.

corr_plot <- ggcorrplot(
  mcor,
  type = "lower",
  method = "circle",
  lab = TRUE,
  title = "Correlation Matrix of Psychological Variables",
  colors = c("red", "white", "blue")
)

ggsave(
  filename = "correlation_matrix.png",
  plot = corr_plot,
  width = 8,
  height = 6,
  dpi = 300
)

How ggsave() Works

  • filename → the name of the file to save.

  • plot → the name of the plot variable you want to save.

  • width and height → the size of the image (in inches by default).

  • dpi → resolution (300 is standard for reports and publications).

The file will be saved to your working directory unless you specify a different file path.

Saving the graph using RStudio functions

You may also want to save your graphs outside of R (e.g. to use in a report or paper). There are a few ways to do this.

  1. To simply export from the “Plots” panel in R (as below).

Activity 9: Power analyses!

Here we are going to learn about how to conduct a power analysis for a correlation.

As you may recall there are some key pieces of information we need for a power analysis, and some specifics that we need for a correlation:

  1. Alpha level (typically 0.05 in Psychology and the social sciences)

  2. The minimum correlation size of interest

  3. Our desired power

  4. If our test is one or two-tailed (i.e. do we have a directional or nondirectional hypothesis)

Reminder that this interactive visualization can be helpful in understanding how these things interact.

This is a good question! Correlation (r) values can be split into arbitrary bands as shown below (Cohen, 1988):

  • Small Effect (r = ~ 0.1)

  • Medium Effect (r = ~ 0.3)

  • Large Effect (r = ~ 0.5)

As we are aware it is easier to detect large effects. As such, if we really have no idea what effect size we should expect then we should power for small effects.

If however we are conducting a replication, or a very similar study to one that has been already done, then we can power for the correlation they report.

Again I refer to: A useful paper if you’re interested in learning more

Power analysis for a simple correlation

The syntax for conducting an apriori statistical power analysis for a simple correlation is the following:

# Conduct power analysis for a simple correlation
pwr.r.test(r = 0.2, # your expected correlation value
           sig.level = 0.05, # Significance level
           power = 0.80, # Desired power level
alternative = "two.sided") # Indicates a two-tailed test, #can be changed to less or greater

     approximate correlation power calculation (arctangh transformation) 

              n = 193.0867
              r = 0.2
      sig.level = 0.05
          power = 0.8
    alternative = two.sided
  1. Try running the above power-analyses again but for a one sided (directional) test. What does this do to our required sample size?

  2. What happens if we increase the correlation value we expect to find?

Activity 10: Choose and Run the Correct Correlation Analysis (Try on your own!)

In this final activity, you will be given a research scenario and a new dataset.

Your task is to:

  1. Decide what type of correlation analysis is appropriate.
  2. Carry out the full analysis following the same steps used earlier in this workshop.
  3. Write up your results clearly.

Research Scenario

A researcher is interested in understanding how different psychological factors relate to academic stress in university students.

They collected data from 180 students on the following variables:

  • Stress – overall academic stress score (higher = more stress)
  • SleepQuality – overall sleep quality score (higher = better sleep)
  • Optimism – optimism score (higher = more optimistic)
  • SocialMediaUse – average minutes per day spent on social media

The research question is:

How are sleep quality, optimism, and social media use related to academic stress?

You’ll find the dataset called student_stress.csv on Canvas for this week.

Step 1: Decide the correct analysis

Before running anything, decide:

  • Should you run a simple correlation or a multiple correlation?

Write one sentence explaining your decision.

Step 2: Load and inspect the dataset

  • Load student_stress.csv and save it as df_stress.
  • Inspect the datase to make sure you have loaded it correctly.

Step 3: Descriptive statistics

Run descriptive statistics for all relevant variables.

You should be able to report:

  • Sample size
  • Means
  • Standard deviations
  • Whether there is missing data

Step 4: Check assumptions

Using the same procedure as earlier in the workshop, evaluate whether Pearson’s correlation is appropriate.

Check:

  1. Are the variables continuous?
  2. Are the distributions approximately normal?
    • Histograms
    • Q–Q plots
    • Shapiro–Wilk tests
  3. Are the relationships linear?
    • Scatterplots or pairs panel
  4. Is homoscedasticity displayed?

If assumptions are not met, make a note of it and run Spearman’s correlation.

Step 5: Run the analysis

Run the appropriate correlation analysis based on your decision in Step 1.

If you are running multiple correlations:

  • Include all relevant variables.
  • Apply a correction for multiple comparisons (e.g., Bonferroni).

Step 6: Interpret the results

For each relevant relationship, interpret:

  • The direction (positive or negative)
  • The strength (weak, moderate, strong)
  • The statistical significance

Then answer the research question clearly in plain English.

Step 7: Create a visualisation

Create one clear figure that would help someone understand the results:

  • A scatterplot (if simple correlation), or
  • A correlation matrix (if multiple correlation).

Step 8: Write up your results

Write a short APA-style results paragraph including:

  • Descriptive statistics
  • The test conducted
  • Correlation coefficient(s)
  • p-value(s)
  • A clear conclusion linked back to the research question.