getwd()[1] "C:/Users/0131045S/Desktop/R/rintro/activities/week7"
In this week’s workshop, we will learn how to conduct both simple and multiple regression in R.
Specifically, you will:
Run a simple linear regression
Run a multiple regression
Check the assumptions of regression
Create visualisations to support your interpretation
Regression allows us to examine whether one (or more) variables significantly predict an outcome variable.
Let’s begin by ensuring your working environment is ready for today’s session. Open RStudio or Posit Cloud and complete the following tasks to set everything up.
One of the first steps in each of these workshops is setting up your *working directory*. The working directory is the default folder where R will look to import files or save any files you export.
If you don’t set the working directory, R might not be able to locate the files you need (e.g., when importing a dataset) or you might not know where your exported files have been saved. Setting the working directory beforehand ensures that everything is in the right place and avoids these issues.
Click:
Session → Set Working Directory → Choose Directory
Navigate to the folder you created for this course (this should be the same folder you used for previous workshops).
Create a new folder called week7 inside this directory.
Select the week7 folder and click Open.
Don’t forget to verify your working directory before we get started
You can always check your current working directory by typing in the following command in the console:
getwd()[1] "C:/Users/0131045S/Desktop/R/rintro/activities/week7"
As in previous weeks we will create an R script that we can use for today’s activities. This week we can call our script 07-regression
Go to the menu bar and select:
File → New File → R Script
This will open an untitled R script.
To save and name your script, select:
File→ Save As, then enter the name:
07-regression
Click Save
At the top of your R script type in and run the following code (this turns off scientific notation from your output):
options(scipen = 999)We’ll use several R packages to make our analysis easier.
REMEMBER: If you encounter an error message like “Error in library(package name): there is no packaged calledpackage name”, you’ll need to install the package first by editing the following for your console:
install.packages("package name") #replace "package name" with the actual package nameHere are the packages we will be using today:
library(jmv) # this will help us run descriptive statistics
library(car) #this will help to run our regression
library(see) # this is needed to check our assumptions
library(performance) # this is needed to check our assumptions
library(ggplot2) # we use this for our graphs
library(ggiraph) # we also use this for our graphs
library(ggiraphExtra) # once again.. graphsThis week we will use a database of movies, with variables on a films budget, runtime, popularity rating, genre, how much revenue it made, and it’s vote rating.
- movies.csv → our dataset for our linear regression
- genres.csv → our dataset for our multiple regression, this is a subset of the first dataset that only includes Action and Romance films
week7 folder.Once this is done load the datasets into R and save them as a dataframes called “df_movies” and “df_genres”:
df_movies <- read.csv("movies.csv")
df_genres <-read.csv("genre.csv")Today we are going to analyse data from a database of movies, with variables on a films budget, runtime, popularity rating, and genre, how much revenue it made, and it’s vote rating. Please note that the values for some of our measures have been normalized.
After loading the datasets, it’s always good practice to inspect it before doing any analyses. You can use the head() function to get an overview of the sleep quality dataset.
head(df_movies) budget revenue
1 24.43952 113.5532
2 41.16180 192.4267
3 108.15756 304.7547
4 53.95969 167.3981
5 64.55049 239.5998
6 148.82935 440.9591
Let’s imagine we’re interested in investigating whether a films budget (how much was spent to make it) predicts its revenue (how much money it earned at the box office).
In this case:
Our predictor variable is: budget
Our outcome measure is: revenue
We could specify our hypothesis as such:
H₀: Budget does not significantly predict revenue.
H₁: Budget significantly predicts revenue.
As we are interested in the whether a variable predicts a continuous variable, this would be best addressed via a simple linear regression.
In contrast to previous tests, in R we run our regressions before we check all our assumptions.
In R, regression is run using the lm() function.
#rememeber this is placeholder code, you would need to replace the variables "Outcome", "Predictor", "OurData" with the corresponding variables in your dataset
LR <- lm(Outcome ~ Predictor, data = OurData) # here we are creating an object "LR" which contatins a linear model of our DV ~ IV Let’s run this linear regression to find out if budget significantly predicts revenue in our df_movies dataset:
LR <- lm(revenue ~ budget, data = df_movies)To review the results of our linear regression we use the summary function on the object LR we just created. We are going to save this as an object LR_summary which will enable us to:
LR_summaryLR_summary <- summary(LR)
LR_summary
Call:
lm(formula = revenue ~ budget, data = df_movies)
Residuals:
Min 1Q Median 3Q Max
-115.921 -25.865 -1.969 27.229 125.937
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.08494 4.18869 13.63 <0.0000000000000002 ***
budget 2.41986 0.04998 48.41 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 41.25 on 798 degrees of freedom
Multiple R-squared: 0.746, Adjusted R-squared: 0.7457
F-statistic: 2344 on 1 and 798 DF, p-value: < 0.00000000000000022
What do these results mean?
The overall regression model was statistically significant:
F(1, 798) = 2344
p < .001
R² = .746
Adjusted R² = .746
This means that budget explains approximately 74.6% of the variance in revenue.
That is a very large proportion of variance explained for a single predictor, indicating that budget is a very strong predictor of box office revenue in this dataset.
The intercept represents the predicted revenue when budget equals zero.
In practical terms, this would correspond to a film with a budget of $0 million, which is not realistic in practice, but mathematically it anchors the regression line.
The intercept is statistically significant (p < .001), although interpretation typically focuses more on the slope.
The regression coefficient for budget is 2.42.
This means:
For every additional $1 million increase in budget, predicted revenue increases by approximately $2.42 million, on average.
This effect is statistically significant:
t = 48.41
p < .001
The very large t-value indicates a very strong relationship between budget and revenue.
Residual standard error = 41.25
This indicates that, on average, the model’s predictions differ from observed revenue values by approximately $41 million.
Given that the mean revenue is about $247 million, this level of prediction error is reasonable.
We will get the descriptive statistics needed for our write-up, using the descriptives function.
descriptives(df_movies,
vars = c("budget", "revenue"))
DESCRIPTIVES
Descriptives
──────────────────────────────────────────────
budget revenue
──────────────────────────────────────────────
N 800 800
Missing 0 0
Mean 78.55675 247.1816
Median 79.02518 243.6045
Standard deviation 29.19317 81.79052
Minimum 5.945634 26.64607
Maximum 180.3367 475.0297
──────────────────────────────────────────────
There are several key assumptions for conducting a linear regression. Conveniently, you will already be familiar with several of these assumptions already, but there is one assumption we have not seen before:
a. The outcome / DV is continuous, and is either interval or ratio.
Interval data = Data that is measured on a numeric scale with equal distance between adjacent values, that does not have a true zero. This is a very common data type in research (e.g. Test scores, IQ etc).
Ratio data = Data that is measured on a numeric scale with equal distance between adjacent values, that has a true zero (e.g. Height, Weight etc).
Here we know our outcome is revenue (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met
b. The predictor variable is interval or ratio or categorical (with two levels)
Here we know our predictor is budget (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met
c. All values of the outcome variable are independent (i.e., each score should come from a different observation - participants, or in this case movie)
Each row in the dataset represents a different film. Therefore, each revenue value comes from a different observation. By inspecting our data, we can see that this assumption is met.
d. The relationship between outcome and predictor is linear
The relationship between the predictor and outcome should be linear (i.e., describable by a straight line). We will check this momentarily.
e. The residuals should be normally distributed
Regression assumes that the residuals (the differences between predicted and actual values) are normally distributed — not the raw variables themselves. Against we will check this momentarily.
f. The assumption of homoscedasticity.
Same as we encountered in correlation, points should have uniform variance across difference points of our predictors and outcome variable. We’ll check this - next!
Checking Assumptions d-f:
These assumptions may all be checked visually for a regression, and conveniently using the function check_model.
check_model(LR)
If you cannot see your plots, then try change the size of the plots pane to take up a larger portion of your screen.
Alternatively, you can then edit our code so that each plot is printed individually.
out <- plot(check_model(LR, panel = FALSE))For confidence bands, please install `qqplotr`.
out$PP_CHECK
Ignoring unknown labels:
• size : ""

$NCV

$HOMOGENEITY

$OUTLIERS

$QQ

The density of the model-predicted revenue closely matches the density of the observed revenue.
The two curves overlap very closely, suggesting the model is capturing the overall distribution of revenue well.
This indicates good overall model fit.
In the residuals vs fitted plot:
The points are randomly scattered.
The reference line is approximately flat and horizontal.
There is no clear curve or systematic patterns
This suggests that the assumption of linearity is met. The relationship between budget and revenue can reasonably be described using a straight line.
In the homogeneity of variance plot:
The spread of residuals appears fairly consistent across fitted values.
There is no clear funnel shape.
Variance does not dramatically increase or decrease at higher predicted revenues.
This suggests the assumption of homoscedasticity is reasonably satisfied.
There may be very slight variation at the extreme high fitted values, but nothing concerning.
In the leverage plot:
Nearly all points fall within the contour lines.
There are no extreme leverage points outside the boundaries.
This suggests there are no highly influential observations disproportionately affecting the model.
In the QQ plot:
Most points fall close to the reference line.
There are only minor deviations at the tails.
Given the large sample size (n = 800), small deviations at the extremes are not problematic.
The residuals appear approximately normally distributed.
g.The predictors have non-zero variance
This is a new assumption, but it just means that the predictor must vary across observations. If every film had the same budget, regression here would be meaningless.
We can visualise this using a scatterplot. We should see spread in both variables and a roughly linear trend.
plot(x=df_movies$budget,y=df_movies$revenue)
abline(lm(revenue ~ budget, data = df_movies), col = "blue", lwd = 2)
The scatterplot shows a clear positive linear relationship between budget and revenue. As budget increases, revenue increases in a roughly straight-line pattern.
There is some spread around the line (which is expected in real data), but the relationship is clearly linear and strong. There is no obvious curvature, clustering, or extreme outliers distorting the relationship.
This visually supports the regression result.

In regression, we often report Cohen’s f² as an effect size.
We calculate f² using adjusted R²:
f2 <- LR_summary$adj.r.squared/(1 - LR_summary$adj.r.squared)
f2[1] 2.932069
This represents an extremely large effect size, indicating that budget is a very strong predictor of revenue in this dataset.
Because f² values above .35 are already considered large, a value of 2.93 suggests the predictor accounts for a very substantial proportion of variance.
Realistically, we are very unlikely to see this in real research!
Interpreting Cohens F (f2)
There are several rules of thumb for interpretation of effect size. For f2 they are as follows:
Small = ~ 0.02
Medium = ~0.15
Large = ~0.35
Here’s how we might write up the results in APA style:
A simple linear regression was conducted to examine whether film budget predicted box office revenue. Budget (M = 78.56, SD = 29.19) and revenue (M = 247.18, SD = 81.79) were measured in millions of dollars (N = 800).
The overall regression model was statistically significant, F(1, 798) = 2344.00, p < .001, explaining 74.6% of the variance in revenue (Adjusted R² = .75). The effect size was extremely large (f² = 2.93). Budget was a significant positive predictor of revenue (β = 2.42, t = 48.41, p < .001), indicating that for every additional $1 million increase in budget, revenue increased by approximately $2.42 million on average.
These findings suggest that higher film budgets are strongly associated with higher box office revenue. Therefore, one can reject the null hypothesis.
Sometimes we’re interested in the impact on multiple predictors on an outcome variable.
In addition to our earlier prediction regarding budget and revenue we could also predict:
1) that a movies genre will predict its revenue.
In this case:
Our predictor variables are: budget, and genres
Our outcome measure is: revenue
As we are interested in the impact of two predictor variables on a continuous outcome variable this would be best addressed via a multiple regression.
A lot of the steps are very similar to a simple linear regression. So we can refer to the above sections for help if we get unsure. Again due to a quirk in how R works we have to run the regression before we can check our assumptions.
Once again we use the lm function to perform the regression. The syntax is:
MR <- lm(revenue ~ budget + genres, data = df_genres)
summary(MR)
Call:
lm(formula = revenue ~ budget + genres, data = df_genres)
Residuals:
Min 1Q Median 3Q Max
-2.0085 -0.2579 0.1101 0.3445 1.2325
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.635517 0.046310 164.879 <0.0000000000000002 ***
budget 0.723058 0.053843 13.429 <0.0000000000000002 ***
genresRomance 0.003978 0.063669 0.062 0.95
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5524 on 330 degrees of freedom
Multiple R-squared: 0.3718, Adjusted R-squared: 0.368
F-statistic: 97.65 on 2 and 330 DF, p-value: < 0.00000000000000022
This model tests:
Whether budget predicts revenue
Whether genre predicts revenue
Whether each predictor explains unique variance in revenue when the other predictor is held constant
We then examine the results:
Once again we can us the summary function to review the results of our multiple regression. We are going to save this as an object “MR_summary” which will enable us to:
MR_summary <- summary(MR)
MR_summary
Call:
lm(formula = revenue ~ budget + genres, data = df_genres)
Residuals:
Min 1Q Median 3Q Max
-2.0085 -0.2579 0.1101 0.3445 1.2325
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.635517 0.046310 164.879 <0.0000000000000002 ***
budget 0.723058 0.053843 13.429 <0.0000000000000002 ***
genresRomance 0.003978 0.063669 0.062 0.95
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5524 on 330 degrees of freedom
Multiple R-squared: 0.3718, Adjusted R-squared: 0.368
F-statistic: 97.65 on 2 and 330 DF, p-value: < 0.00000000000000022
The regression model was statistically significant:
F(2, 330) = 97.65, p < .001
This means that, taken together, budget and genre significantly predict revenue.
The model explains:
R² = 0.372
Adjusted R² = 0.368
So approximately 37% of the variance in revenue is explained by budget and genre combined. That is a substantial proportion.
Budget was a significant predictor:
β = 0.723
t = 13.43
p < .001
This means that, holding genre constant, higher budgets are associated with higher revenue.
Specifically, for every 1-unit increase in budget, revenue increases by approximately 0.72 units (in the scaled metric used in this dataset).
This is a strong and statistically robust effect.
Genre (Romance vs Action) was not a significant predictor:
β = 0.004
t = 0.062
p = .95
This indicates that, after controlling for budget, genre does not significantly predict revenue.
In practical terms: once you account for how much money was spent making the movie, whether it was Action or Romance does not meaningfully affect revenue in this dataset.
descriptives(data = df_genres,
vars = c("revenue", "budget", "genres"))
DESCRIPTIVES
Descriptives
────────────────────────────────────────────────────────────
revenue budget genres
────────────────────────────────────────────────────────────
N 333 333 333
Missing 0 0 0
Mean 7.569269 -0.09474513
Median 7.664798 0.04210667
Standard deviation 0.6948411 0.5867119
Minimum 5.434315 -2.179742
Maximum 9.177897 0.9342013
────────────────────────────────────────────────────────────
#since genres is categorical, we are going to check the frequency of each response using the following function
table(df_genres$genres)
Action Romance
144 189
The assumptions for a multiple regression are the same as for a linear regression but with one extra Multicolinearity. Simply put this assumption means that none of our predictors can be too correlated with each other.
a. The outcome / DV is continuous, and is either interval or ratio.
Interval data = Data that is measured on a numeric scale with equal distance between adjacent values, that does not have a true zero. This is a very common data type in research (e.g. Test scores, IQ etc).
Ratio data = Data that is measured on a numeric scale with equal distance between adjacent values, that has a true zero (e.g. Height, Weight etc).
Here we know our outcome is revenue (measured as money) and as such does have a true zero. As such it is ratio data and this assumption has been met
b. The predictor variable is interval or ratio or categorical (with two levels)
c. All values of the outcome variable are independent (i.e., each score should come from a different observation - participants, or in this case movie)
d. The predictors have non-zero variance
e. The relationship between outcome and predictor is linear
f. The residuals should be normally distributed
g. The assumption of homoscedasticity.
h. The assumption of multicolinearity.
Assumptions e-h:
These assumptions may all be checked visually for a regression, and conveniently using the function check_model.
check_model(MR)
We showed above how to output these as individual plots, that may be helpful here also
The assumption checks look good.
Observed and model-predicted distributions closely overlap, indicating good overall fit.
Residuals are randomly scattered around zero with no visible curvature.
The linearity assumption appears satisfied.
Residual spread is relatively constant across fitted values.
There is no clear funnel pattern.
Homoscedasticity appears met.
Points in the QQ plot fall close to the line, with only minor tail deviation.
Given the sample size (n ≈ 333), this is more than acceptable.
The VIF plot shows values well below 5.
There is no evidence of problematic multicollinearity between budget and genre.
Overall: assumptions are adequately met.
As before we will calculate Cohens F using the below code:
MR_f2 <- MR_summary$adj.r.squared/(1 - MR_summary$adj.r.squared)
MR_f2[1] 0.5822341
An f² of 0.58 indicates a large effect size.
How we might write up the results in APA style?
A multiple linear regression was conducted to examine whether budget and genre predicted movie revenue. The overall model was statistically significant, F(2, 330) = 97.65, p < .001, explaining 37% of the variance in revenue (Adjusted R² = .37, f² = .58).
Budget was a significant positive predictor of revenue (β = 0.72, t = 13.43, p < .001), indicating that movies with larger budgets generated higher revenue, controlling for genre. Genre was not a significant predictor of revenue (β = 0.004, t = 0.06, p = .95).
These results suggest that budget is the primary driver of revenue in this dataset, whereas genre does not explain additional variance beyond budget.
NB it might also be beneficial to see descriptive statistics reported here, such as means and standard deviations by genre.
We need to visualize our data not only to check our assumptions but also to include in our write-up / results / dissertations. As you may see above the write-up for a multiple regression can be lengthy/confusing, and a good graphic can help your reader (and you) understand the results more easily. This is particularly true when we’re dealing with interactions.
Today we’ll be using the ggPredictfunction. We will learn a lot more about making visualizations in week 9, but for today we will learn how to quickly and clearly visualize our regression results.
ggPredict uses the following syntax:
# ggPredict(ModelName)So now if we try this for our Linear Regression:
ggPredict(LR)Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the ggiraphExtra package.
Please report the issue to the authors.

Next lets try this for our multiple regression. Now we can use ggPredict to create a visualisation for this new model
ggPredict(MR)
Bonus activity!
For both of the above graphs try changing the code so that the graph is interactive, using the below syntax:
ggPredict(ModelName, interactive = TRUE)What does this do to your graphs?
It does what is says on the tin, it makes your graph interactive so you can identify specific datapoints and the slope of the regression line. Cool right?!

Imagine you are working as a data analyst for a film production company. The company wants to better understand what drives box office success.
Up to this point, we have examined the role of budget and genre. However, the company now wants to know:
Does a film’s budget and its popularity rating both predict revenue?
In other words:
Do higher-budget films generate more revenue?
Do more popular films generate more revenue?
And when considered together, do both variables uniquely contribute to predicting revenue?
In this scenario:
Predictors: budget, popularity
Outcome: revenue
Because we are examining the influence of two continuous predictors on a continuous outcome variable, this is addressed using a multiple linear regression.
Specify revenue as the outcome and include both budget and popularity as predictors.
Evaluate the model
Is the overall model statistically significant?
How much variance in revenue is explained (R² / Adjusted R²)?
Are both predictors significant?
Does each predictor explain unique variance in revenue?
Check assumptions
Visualise the model
Use ggpredict() to generate predicted values and produce a clear visualisation of the regression model.
Write up the results in APA style