Week 7 Activities (Regression in R)

In this week’s workshop, we will learn how to conduct both simple and multiple regression in R.

Specifically, you will:

Regression allows us to examine whether one (or more) variables significantly predict an outcome variable.

Let’s Get Set Up

Let’s begin by ensuring your working environment is ready for today’s session. Open RStudio or Posit Cloud and complete the following tasks to set everything up.

Activity 1: Set Up Your Working Directory & R Script for this week

One of the first steps in each of these workshops is setting up your *working directory*. The working directory is the default folder where R will look to import files or save any files you export.

If you don’t set the working directory, R might not be able to locate the files you need (e.g., when importing a dataset) or you might not know where your exported files have been saved. Setting the working directory beforehand ensures that everything is in the right place and avoids these issues.

  1. Click:
    Session → Set Working Directory → Choose Directory

  2. Navigate to the folder you created for this course (this should be the same folder you used for previous workshops).

  3. Create a new folder called week7 inside this directory.

  4. Select the week7 folder and click Open.

Don’t forget to verify your working directory before we get started

You can always check your current working directory by typing in the following command in the console:

getwd()
[1] "C:/Users/0131045S/Desktop/R/rintro/activities/week7"

As in previous weeks we will create an R script that we can use for today’s activities. This week we can call our script 07-regression

  1. Go to the menu bar and select:
    File → New File → R Script

    This will open an untitled R script.

  2. To save and name your script, select:

    File→ Save As, then enter the name:

    07-regression

    Click Save

At the top of your R script type in and run the following code (this turns off scientific notation from your output):

options(scipen = 999)

Activity 2: Installing / Loading R Packages

We’ll use several R packages to make our analysis easier.

REMEMBER: If you encounter an error message like “Error in library(package name): there is no packaged calledpackage name”, you’ll need to install the package first by editing the following for your console:

install.packages("package name") #replace "package name" with the actual package name

Here are the packages we will be using today:

library(jmv) # this will help us run descriptive statistics 
library(car) #this will help to run our regression
library(see) # this is needed to check our assumptions
library(performance) # this is needed to check our assumptions
library(ggplot2) # we use this for our graphs
library(ggiraph) # we also use this for our graphs
library(ggiraphExtra) # once again.. graphs

Activity 3: Load in your datasets

This week we will use a database of movies, with variables on a films budget, runtime, popularity rating, genre, how much revenue it made, and it’s vote rating.

- movies.csv our dataset for our linear regression

- genres.csv our dataset for our multiple regression, this is a subset of the first dataset that only includes Action and Romance films

Download these files onto your computer and move them to your week7 folder.

Once this is done load the datasets into R and save them as a dataframes called “df_movies” and “df_genres”:

df_movies <- read.csv("movies.csv")
df_genres <-read.csv("genre.csv")

Linear Regression

Today we are going to analyse data from a database of movies, with variables on a films budget, runtime, popularity rating, and genre, how much revenue it made, and it’s vote rating. Please note that the values for some of our measures have been normalized.

After loading the datasets, it’s always good practice to inspect it before doing any analyses. You can use the head() function to get an overview of the sleep quality dataset.

head(df_movies)
     budget  revenue
1  24.43952 113.5532
2  41.16180 192.4267
3 108.15756 304.7547
4  53.95969 167.3981
5  64.55049 239.5998
6 148.82935 440.9591

Activity 4: Running a Simple Linear Regression

Let’s imagine we’re interested in investigating whether a films budget (how much was spent to make it) predicts its revenue (how much money it earned at the box office).

In this case:

  • Our predictor variable is: budget

  • Our outcome measure is: revenue

We could specify our hypothesis as such:

H₀: Budget does not significantly predict revenue.

H₁: Budget significantly predicts revenue.

As we are interested in the whether a variable predicts a continuous variable, this would be best addressed via a simple linear regression.

In contrast to previous tests, in R we run our regressions before we check all our assumptions.

Running the Linear Regression

In R, regression is run using the lm() function.

#rememeber this is placeholder code, you would need to replace the variables "Outcome", "Predictor", "OurData" with the corresponding variables in your dataset

LR <- lm(Outcome ~ Predictor, data = OurData) # here we are creating an object "LR" which contatins a linear model of our DV ~ IV 

Let’s run this linear regression to find out if budget significantly predicts revenue in our df_movies dataset:

LR <- lm(revenue ~ budget, data = df_movies)

To review the results of our linear regression we use the summary function on the object LR we just created. We are going to save this as an object LR_summary which will enable us to:

  1. Check the results by typing out and entering the variable LR_summary
  2. Use this object later on when we calculate our effect size
LR_summary <- summary(LR)

LR_summary

Call:
lm(formula = revenue ~ budget, data = df_movies)

Residuals:
     Min       1Q   Median       3Q      Max 
-115.921  -25.865   -1.969   27.229  125.937 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 57.08494    4.18869   13.63 <0.0000000000000002 ***
budget       2.41986    0.04998   48.41 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.25 on 798 degrees of freedom
Multiple R-squared:  0.746, Adjusted R-squared:  0.7457 
F-statistic:  2344 on 1 and 798 DF,  p-value: < 0.00000000000000022

What do these results mean?

The overall regression model was statistically significant:

  • F(1, 798) = 2344

  • p < .001

  • R² = .746

  • Adjusted R² = .746

This means that budget explains approximately 74.6% of the variance in revenue.

That is a very large proportion of variance explained for a single predictor, indicating that budget is a very strong predictor of box office revenue in this dataset.

Interpreting the Coefficients

Intercept (57.08)

The intercept represents the predicted revenue when budget equals zero.

In practical terms, this would correspond to a film with a budget of $0 million, which is not realistic in practice, but mathematically it anchors the regression line.

The intercept is statistically significant (p < .001), although interpretation typically focuses more on the slope.

Budget (2.42)

The regression coefficient for budget is 2.42.

This means:

For every additional $1 million increase in budget, predicted revenue increases by approximately $2.42 million, on average.

This effect is statistically significant:

  • t = 48.41

  • p < .001

The very large t-value indicates a very strong relationship between budget and revenue.

Residual Standard Error

Residual standard error = 41.25

This indicates that, on average, the model’s predictions differ from observed revenue values by approximately $41 million.

Given that the mean revenue is about $247 million, this level of prediction error is reasonable.

Activity 5: Descriptive Statistics and Assumptions

Descriptive Statistics

We will get the descriptive statistics needed for our write-up, using the descriptives function.

descriptives(df_movies,
             vars = c("budget", "revenue"))

 DESCRIPTIVES

 Descriptives                                   
 ────────────────────────────────────────────── 
                         budget      revenue    
 ────────────────────────────────────────────── 
   N                          800         800   
   Missing                      0           0   
   Mean                  78.55675    247.1816   
   Median                79.02518    243.6045   
   Standard deviation    29.19317    81.79052   
   Minimum               5.945634    26.64607   
   Maximum               180.3367    475.0297   
 ────────────────────────────────────────────── 

Simple Linear Regression Assumptions

There are several key assumptions for conducting a linear regression. Conveniently, you will already be familiar with several of these assumptions already, but there is one assumption we have not seen before:

a. The outcome / DV is continuous, and is either interval or ratio.

Interval data = Data that is measured on a numeric scale with equal distance between adjacent values, that does not have a true zero. This is a very common data type in research (e.g. Test scores, IQ etc).

Ratio data = Data that is measured on a numeric scale with equal distance between adjacent values, that has a true zero (e.g. Height, Weight etc).

Here we know our outcome is revenue (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met

b. The predictor variable is interval or ratio or categorical (with two levels)

Here we know our predictor is budget (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met

c. All values of the outcome variable are independent (i.e., each score should come from a different observation - participants, or in this case movie)

Each row in the dataset represents a different film. Therefore, each revenue value comes from a different observation. By inspecting our data, we can see that this assumption is met.

d. The relationship between outcome and predictor is linear

The relationship between the predictor and outcome should be linear (i.e., describable by a straight line). We will check this momentarily.

e. The residuals should be normally distributed

Regression assumes that the residuals (the differences between predicted and actual values) are normally distributed — not the raw variables themselves. Against we will check this momentarily.

f. The assumption of homoscedasticity.

Same as we encountered in correlation, points should have uniform variance across difference points of our predictors and outcome variable. We’ll check this - next!

Checking Assumptions d-f:

These assumptions may all be checked visually for a regression, and conveniently using the function check_model.

check_model(LR)

If you cannot see your plots, then try change the size of the plots pane to take up a larger portion of your screen.

Alternatively, you can then edit our code so that each plot is printed individually.

out <- plot(check_model(LR, panel = FALSE))
For confidence bands, please install `qqplotr`.
out
$PP_CHECK
Ignoring unknown labels:
• size : ""


$NCV


$HOMOGENEITY


$OUTLIERS


$QQ

Interpreting Each Plot

Posterior Predictive Check

The density of the model-predicted revenue closely matches the density of the observed revenue.

The two curves overlap very closely, suggesting the model is capturing the overall distribution of revenue well.

This indicates good overall model fit.

Linearity (Residuals vs Fitted)

In the residuals vs fitted plot:

  • The points are randomly scattered.

  • The reference line is approximately flat and horizontal.

  • There is no clear curve or systematic patterns

  • This suggests that the assumption of linearity is met. The relationship between budget and revenue can reasonably be described using a straight line.

4. Homoscedasticity (Scale–Location Plot)

In the homogeneity of variance plot:

  • The spread of residuals appears fairly consistent across fitted values.

  • There is no clear funnel shape.

  • Variance does not dramatically increase or decrease at higher predicted revenues.

  • This suggests the assumption of homoscedasticity is reasonably satisfied.

There may be very slight variation at the extreme high fitted values, but nothing concerning.

5. Influential Observations

In the leverage plot:

  • Nearly all points fall within the contour lines.

  • There are no extreme leverage points outside the boundaries.

    This suggests there are no highly influential observations disproportionately affecting the model.

6. Normality of Residuals (QQ Plot)

In the QQ plot:

  • Most points fall close to the reference line.

  • There are only minor deviations at the tails.

Given the large sample size (n = 800), small deviations at the extremes are not problematic.

The residuals appear approximately normally distributed.

g.The predictors have non-zero variance

This is a new assumption, but it just means that the predictor must vary across observations. If every film had the same budget, regression here would be meaningless.

We can visualise this using a scatterplot. We should see spread in both variables and a roughly linear trend.

plot(x=df_movies$budget,y=df_movies$revenue)

abline(lm(revenue ~ budget, data = df_movies), col = "blue", lwd = 2)

The scatterplot shows a clear positive linear relationship between budget and revenue. As budget increases, revenue increases in a roughly straight-line pattern.

There is some spread around the line (which is expected in real data), but the relationship is clearly linear and strong. There is no obvious curvature, clustering, or extreme outliers distorting the relationship.

This visually supports the regression result.

Effect sizes!

In regression, we often report Cohen’s f² as an effect size.

We calculate f² using adjusted R²:

f2 <- LR_summary$adj.r.squared/(1 - LR_summary$adj.r.squared)

f2
[1] 2.932069

This represents an extremely large effect size, indicating that budget is a very strong predictor of revenue in this dataset.

Because f² values above .35 are already considered large, a value of 2.93 suggests the predictor accounts for a very substantial proportion of variance.

Realistically, we are very unlikely to see this in real research!

Interpreting Cohens F (f2)

There are several rules of thumb for interpretation of effect size. For f2 they are as follows:

Small = ~ 0.02

Medium = ~0.15

Large = ~0.35

Here’s how we might write up the results in APA style:

A simple linear regression was conducted to examine whether film budget predicted box office revenue. Budget (M = 78.56, SD = 29.19) and revenue (M = 247.18, SD = 81.79) were measured in millions of dollars (N = 800).

The overall regression model was statistically significant, F(1, 798) = 2344.00, p < .001, explaining 74.6% of the variance in revenue (Adjusted R² = .75). The effect size was extremely large (f² = 2.93). Budget was a significant positive predictor of revenue (β = 2.42, t = 48.41, p < .001), indicating that for every additional $1 million increase in budget, revenue increased by approximately $2.42 million on average.

These findings suggest that higher film budgets are strongly associated with higher box office revenue. Therefore, one can reject the null hypothesis.

Multiple Regression

Sometimes we’re interested in the impact on multiple predictors on an outcome variable.

In addition to our earlier prediction regarding budget and revenue we could also predict:

1) that a movies genre will predict its revenue.

In this case:

Our predictor variables are: budget, and genres

Our outcome measure is: revenue

As we are interested in the impact of two predictor variables on a continuous outcome variable this would be best addressed via a multiple regression.

A lot of the steps are very similar to a simple linear regression. So we can refer to the above sections for help if we get unsure. Again due to a quirk in how R works we have to run the regression before we can check our assumptions.

Activity 6: Running the Multiple Regression

Once again we use the lm function to perform the regression. The syntax is:

MR <- lm(revenue ~ budget + genres, data = df_genres)

summary(MR)

Call:
lm(formula = revenue ~ budget + genres, data = df_genres)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0085 -0.2579  0.1101  0.3445  1.2325 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   7.635517   0.046310 164.879 <0.0000000000000002 ***
budget        0.723058   0.053843  13.429 <0.0000000000000002 ***
genresRomance 0.003978   0.063669   0.062                0.95    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5524 on 330 degrees of freedom
Multiple R-squared:  0.3718,    Adjusted R-squared:  0.368 
F-statistic: 97.65 on 2 and 330 DF,  p-value: < 0.00000000000000022

This model tests:

  • Whether budget predicts revenue

  • Whether genre predicts revenue

  • Whether each predictor explains unique variance in revenue when the other predictor is held constant

We then examine the results:

Once again we can us the summary function to review the results of our multiple regression. We are going to save this as an object “MR_summary” which will enable us to:

  1. Check the results by typing that in to our console
  2. Use this object later on when we calculate our effect size
MR_summary <- summary(MR)
MR_summary

Call:
lm(formula = revenue ~ budget + genres, data = df_genres)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0085 -0.2579  0.1101  0.3445  1.2325 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   7.635517   0.046310 164.879 <0.0000000000000002 ***
budget        0.723058   0.053843  13.429 <0.0000000000000002 ***
genresRomance 0.003978   0.063669   0.062                0.95    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5524 on 330 degrees of freedom
Multiple R-squared:  0.3718,    Adjusted R-squared:  0.368 
F-statistic: 97.65 on 2 and 330 DF,  p-value: < 0.00000000000000022

Interpretation of the Multiple Regression

Overall Model

The regression model was statistically significant:

F(2, 330) = 97.65, p < .001

This means that, taken together, budget and genre significantly predict revenue.

The model explains:

R² = 0.372

Adjusted R² = 0.368

So approximately 37% of the variance in revenue is explained by budget and genre combined. That is a substantial proportion.

Individual Predictors

Budget

Budget was a significant predictor:

β = 0.723
t = 13.43
p < .001

This means that, holding genre constant, higher budgets are associated with higher revenue.

Specifically, for every 1-unit increase in budget, revenue increases by approximately 0.72 units (in the scaled metric used in this dataset).

This is a strong and statistically robust effect.

Genre

Genre (Romance vs Action) was not a significant predictor:

β = 0.004
t = 0.062
p = .95

This indicates that, after controlling for budget, genre does not significantly predict revenue.

In practical terms: once you account for how much money was spent making the movie, whether it was Action or Romance does not meaningfully affect revenue in this dataset.

Activity 7: Descriptives and Checking Assumptions

descriptives(data = df_genres,
             vars = c("revenue", "budget", "genres"))

 DESCRIPTIVES

 Descriptives                                                 
 ──────────────────────────────────────────────────────────── 
                         revenue      budget         genres   
 ──────────────────────────────────────────────────────────── 
   N                           333            333       333   
   Missing                       0              0         0   
   Mean                   7.569269    -0.09474513             
   Median                 7.664798     0.04210667             
   Standard deviation    0.6948411      0.5867119             
   Minimum                5.434315      -2.179742             
   Maximum                9.177897      0.9342013             
 ──────────────────────────────────────────────────────────── 
#since genres is categorical, we are going to check the frequency of each response using the following function
table(df_genres$genres)

 Action Romance 
    144     189 

The assumptions for a multiple regression are the same as for a linear regression but with one extra Multicolinearity. Simply put this assumption means that none of our predictors can be too correlated with each other.

a. The outcome / DV is continuous, and is either interval or ratio.

Interval data = Data that is measured on a numeric scale with equal distance between adjacent values, that does not have a true zero. This is a very common data type in research (e.g. Test scores, IQ etc).

Ratio data = Data that is measured on a numeric scale with equal distance between adjacent values, that has a true zero (e.g. Height, Weight etc).

Here we know our outcome is revenue (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met

b. The predictor variable is interval or ratio or categorical (with two levels)

c. All values of the outcome variable are independent (i.e., each score should come from a different observation - participants, or in this case movie)

d. The predictors have non-zero variance

e. The relationship between outcome and predictor is linear

f. The residuals should be normally distributed

g. The assumption of homoscedasticity.

h. The assumption of multicolinearity.

Assumptions e-h:

These assumptions may all be checked visually for a regression, and conveniently using the function check_model.

check_model(MR)

We showed above how to output these as individual plots, that may be helpful here also

Interpretation of Diagnostic Plots

The assumption checks look good.

1. Posterior Predictive Check

Observed and model-predicted distributions closely overlap, indicating good overall fit.

2. Linearity

Residuals are randomly scattered around zero with no visible curvature.
The linearity assumption appears satisfied.

3. Homoscedasticity

Residual spread is relatively constant across fitted values.
There is no clear funnel pattern.
Homoscedasticity appears met.

4. Normality of Residuals

Points in the QQ plot fall close to the line, with only minor tail deviation.

Given the sample size (n ≈ 333), this is more than acceptable.

5. Multicollinearity

The VIF plot shows values well below 5.
There is no evidence of problematic multicollinearity between budget and genre.

Overall: assumptions are adequately met.

Effect sizes!

As before we will calculate Cohens F using the below code:

MR_f2 <- MR_summary$adj.r.squared/(1 - MR_summary$adj.r.squared)

MR_f2
[1] 0.5822341

An f² of 0.58 indicates a large effect size.

How we might write up the results in APA style?

A multiple linear regression was conducted to examine whether budget and genre predicted movie revenue. The overall model was statistically significant, F(2, 330) = 97.65, p < .001, explaining 37% of the variance in revenue (Adjusted R² = .37, f² = .58).

Budget was a significant positive predictor of revenue (β = 0.72, t = 13.43, p < .001), indicating that movies with larger budgets generated higher revenue, controlling for genre. Genre was not a significant predictor of revenue (β = 0.004, t = 0.06, p = .95).

These results suggest that budget is the primary driver of revenue in this dataset, whereas genre does not explain additional variance beyond budget.

NB it might also be beneficial to see descriptive statistics reported here, such as means and standard deviations by genre.

Activity 8: Graphs

We need to visualize our data not only to check our assumptions but also to include in our write-up / results / dissertations. As you may see above the write-up for a multiple regression can be lengthy/confusing, and a good graphic can help your reader (and you) understand the results more easily. This is particularly true when we’re dealing with interactions.

Today we’ll be using the ggPredictfunction. We will learn a lot more about making visualizations in week 9, but for today we will learn how to quickly and clearly visualize our regression results.

ggPredict uses the following syntax:

# ggPredict(ModelName)

So now if we try this for our Linear Regression:

ggPredict(LR)
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggiraphExtra package.
  Please report the issue to the authors.

Next lets try this for our multiple regression. Now we can use ggPredict to create a visualisation for this new model

ggPredict(MR)

Bonus activity!

  • For both of the above graphs try changing the code so that the graph is interactive, using the below syntax:

    ggPredict(ModelName, interactive = TRUE)

What does this do to your graphs?

It does what is says on the tin, it makes your graph interactive so you can identify specific datapoints and the slope of the regression line. Cool right?!

Activity 9: Multiple Regression with Two Continuous Predictors

Imagine you are working as a data analyst for a film production company. The company wants to better understand what drives box office success.

Up to this point, we have examined the role of budget and genre. However, the company now wants to know:

Does a film’s budget and its popularity rating both predict revenue?

In other words:

  • Do higher-budget films generate more revenue?

  • Do more popular films generate more revenue?

  • And when considered together, do both variables uniquely contribute to predicting revenue?

In this scenario:

Predictors: budget, popularity

Outcome: revenue

Because we are examining the influence of two continuous predictors on a continuous outcome variable, this is addressed using a multiple linear regression.

Your Steps

  1. Run the multiple regression model
Specify revenue as the outcome and include both budget and popularity as predictors.
  1. Evaluate the model

    Is the overall model statistically significant?

    How much variance in revenue is explained (R² / Adjusted R²)?

    Are both predictors significant?

    Does each predictor explain unique variance in revenue?

  2. Check assumptions

  3. Visualise the model

    Use ggpredict() to generate predicted values and produce a clear visualisation of the regression model.

  4. Write up the results in APA style