Notes on Udemy Class: R, ggplot, and Simple Linear Regression
Working through a free course at Udemy R, ggplot, and Simple Linear Regression. My notes will follow
I am posting this tutorial as I learn R. I will respond to feedback for errata in the comments.
I will not address any issues that we have covered in the getting started section or the Kaggle Titanic tutorial follow-along.
To follow along with this blog post you can head to Udemy R, ggplot, and Simple Linear Regression. Register and and follow along with the free class.
Getting Started
Introduction
Dr. Charles Redmond has a series of R videos at Udemy. In the introduction he gives a motivation that he wants to bring data science under other disciplines. His goal is understanding linear regression. I am already familiar with linear regression, but a refresher is good.
This course is provided for free and I don’t want to be too harsh, but I will point out shortcomings and errors in the tutorial. Anyone that makes a course and provides it for free needs to be commended, even if the course or material is not perfect. This is why I always ask for feedback. I will use the feedback and adjust and fix my material as needed
Installing R and R-Studio
Some basics of R and the interface R-Studio. Nothing new for now. Getting R at http://wwww.r-project.org (I usually go to https://cran.r-project.org/ directly). He recommends the base distribution. Also recommends R-Studio at http://www.rstudio.com.
A Tour of R-Studio
Talking about the panels in R-Studio and how to use the “restore-down” button on the console to review an empty script panel. I’ve always just clicked new, didn’t even know there was a ready text panel.
He likes to use the console to do his interactive work and the top panel for scripting. He mentions that he will rarely use the “Environment” panel. I actually use it quite a bit to quickly inspect values. Its quite handy to use it to click a dataframe and view the content in a nice spreadsheet-like table.
He also mentions the Ctrl-L hotkey for clearing the console. Expanding the menus at the top will reveal other shortcuts. I accidentaly found Ctrl-L trying to delete a line in a script file. Many text editors - including Notepad++ - uses that for clearing a single line.
Vectors
I already covered vectors in Getting Started with R - Part 2: Vector Basics and Getting Started with R - Part 3: Vector Operations. Notice that he typed X<-3
in the console, this is not acceptable formatting because it can be confused for “Is x less than minus 3?”.
He shows how to use the combine function to create a vector using c()
and basic indexing with the []
. His showing of slicing with :
operator hides the real dynamic of the the :
operator. This operator is really a range generator :- essentially creating a series and that the series is actually then the vector for indexing. I covered this in Getting Started with R - Part 4: Vector Extracting, Replacing and Excluding. For example: -2:-6
is really just c(-2, -3, -4, -5, -6)
. Since indices with a -
are simply excluded.
This lesson glosses over important concepts of the []
Extract\Replace operator. Please understand how it works early on to save you headaches later
Data Frames
He skips lists and matrices and move to data frames. Nothing much new here. Immediatly jumps into indexing by name with the $
operator.
Style
A note on the lecturer starting a data frame with a capital letter. Dataframes are identifiers and therefore I think it is odd to use a different style over any other identifier.
There is no universal R Style guide and even Google’s own style guide to some extent does not play well with much of the CRAN Repository. In case you wonder; Google’s style guide prefers .
between lowercase words in an identifier like my.vector.of.temps
over camelCase (myVectorOfTemps
) and advises against underscores my_vector_of_temps
. It also recommends function names be in PascalCase like this MyFunction
. Other styleguides (like Hadley Wickham’s) recommend undescores between words in identifiers the way you see me use it, but he advises against dots in function names. However, the CRAN is littered with lots of libraries that use .
.
I don’t think the Google styleguide is not a good choice, I am confused that they did not go with Wicham’s style. But even a bad choice can help a team work in a consistent style. So your team will need to decide on a style. I think Wickham’s style guide is superior, but even I cherry pick what I want from it.
There are a few things that are almost universal required for basic R style:
- Spaces either sides of binary operators (with the exception of the
:
operator), - spaces after commas. Also,
- you want to add an space between anything that is not a function (keyword, unary operator, identifier) and an opening paren like this
if (a_logical)
, because otherwise it looks like a function.
Working with ggplot
installing ggplot2
He shows how to install a package using the “Tools, Install Packages…” menu and then typing the name. I usually just type something like install.packages("ggplot2")
in the console. Why? Because when you work on most projects the package is often included in the script (sometimes the install.packages()
command is just commented out). So usually copy and paste or uncomment the line in the script and be done.
He mentions that once the cursor stops you know the package is installed. There is an easier way to see if the package is installed. Look to the right panel where it shows “Files, Plots, Packages, Help, viewer”. Simply see if the package is listed under the Packages tab.
He does mention the library()
function we saw in the Kaggle Titanic tutorial follow-along. In this case its a trivial library(ggplot2)
.
Just a note on CRAN. I’ve had trouble getting packages to install. Sometimes the servers are just too busy, and often you just have to try at a later time. I don’t know of any trustworth repositories outside of CRAN.
Another way to get ggplot2 along with the rest of a family of packages call the “tidyverse”; you can simply install using install.packages('tidyverse')
. The tidyverse is a bit more than you need for this tutorial, but is very handy for data science in general
Plotting a point with ggplot
Yes, the package is ggplot2
and the function is ggplot
.
The ggplot()
constructs the inital plot object often given a default dataset and aesthetics. The function is almost always followed by the +
operator to add components to the plot. The sample created here was just a simple skeletal plot to which a geometry point layer was added.
# code like this: spaces on either side of infix operators (=, <-, +) and spaces after commas
ggplot() + geom_point(data = dat, mapping = aes(x = x, y = y), color = "blue")
You’ll notice that he is supplying the aesthetics opject without a parameter name. Its actually the second parameter so if he wanted to omit the name he should have had this format, otherwise specifify the name “mapping”
ggplot() + geom_point( aes(x = x, y = y), data = dat, color = "blue")
The ggplot
constructor has three basic forms.
ggplot(data, mapping)
, useful if all layers will use the same data and aestheticsggplot(data)
, when layers share data, but vary in aesthetics; andggplot()
, where layers control data and aesthetics
So the following will yield the exact same result
ggplot(dat, aes(x = x, y = y)) + geom_point(size = 5, color = "blue")
The data is the first parameter in ggplot, and mapping is the second so I can supply dat
and aes
, without variable names. Just note that the two parameters are reversed in the geom_point
and geom_line
functions. So most samples in this Udemy class are technically wrong since the aesthetics object is passed without a parameter name ever time as the second parameter into these functions. There may be an error handling mechanism that correctly assings the aesthetics option to the “mapping” variable, but technically these calls should have produced erros.
His explanation of assigning x to x and y to y is a bit brief. Here is deal: the =
operator assign at current context and lower, while <-
assigns in the global context. Here the x
on the left is in defined inside the context of the function and the x
on the right is on the global context. In this case the x on the right will reference dat$x
. Since the console context cannot see what is assigned and cannot even see the x
on the left this is a perfectly fine assignment. This type of assignment is most commonly used to pass parameters by name.
Just a note. Since the first two parameters of the aes
objects are so commonly used it is common to drop the parameter names. Pretty much everyone knows they are x and y.
Controlling axis properties
In this section he creates another empty ggplot and adds the geometry points to the plot like this
ggplot() + geom_point(data = dat, mapping = aes(x = x, y = y), size = 5, color = "blue" )
If you read my notes on the previous section you’ll know that if he wanted to ommit the mapping =
then the parameter should have been the first one in geom_point
. The documentation has more on this
ggplot() + geom_point(data = dat, mapping = aes(x = x, y = y), size = 5, color = "blue" )
# or
ggplot() + geom_point(aes(x = x, y = y), data = dat, size = 5, color = "blue" )
# or simplified
ggplot() + geom_point(aes(x, y), dat, size = 5, color = "blue" )
# or prefereably having the data and aesthethics at the highest level possible
ggplot(data = dat, mapping = aes(x, y)) + geom_point(size = 5, color = "blue" )
He mentions that he has already loaded ggplot2 into the session. An easy way to do that is to go to the packages tab on the right and check the box next to ggplot. That will automatically execute library("ggplot2")
scale_*_continuous
and other continuous scales
scale_x_continuous
and scale_y_continuous
(scale_*_continuous
). Are the default linear continuous scales, there are other continous scales :
scale_*_log10()
scale_x_reverse()
scale_x_sqrt()
Beside allowing for the limits, the scales allow labels and positioning of the scales on the plot, among other features.
One important note. If you plot say a line with geom_line
and one of the points is outside the scale then the line will not be drawn.
seq()
Remember the :
operator? Well 3:17
is equivalent to seq(3, 17)
. The seq()
function allows more control, including step size or data sized based on a repeat pattern. In this case seq(0, 15, 1)
is the same as seq(0, 15)
, which is the same as 0:15
More with color and shape
No need to highlight before ctrl-enter. It executes code line by line. It understands continued lines of code and will excute them as one and move the cursor one position down so repeatedly pressing ctrl-enter is like stepping through as it executes.
For reference:
- The color chart and
- the shape chart he showed
Graphing lines with ggplot
In this case he uses geom_line
. If he created the ggplot
object earlier with the data and aesthetics he could have subsituted the points for lines with the by switching the function. This is a big reason the first form of ggplot(data, mapping)
is generally prefered if data and aesthetics are common between all layers
More with lines
If you remember my post on vector operations (and the illustration of the rack-and-pinion gear) you know how a smaller vector gets applied in multiples over the larger vector. It should be no surprise that you can assign the vector y based on math of the applied to another.
Sampling form populations
Normal populations
Explains that samples typically near the mean with equal probability of a sample to be on either size of the mean. Explains that standard deviation is how wide the range is around the mean.
rnorm
has the following signature rnorm(n, mean = 0, sd = 1)
where n
is the number of observations.
Plotting a vertical sample
You can also create the dataframe in a single statement like this
dat <- data.frame(x = 1, y = rnorm(100, 50, 10))
No need to use rep since the larger vector “y” will dictate the size and since x = 1 is a vector of single element it is applied accross the entire dataframe.
mean
is actually the name of a built in function so its a bad identifier name.
Another way to output this data would be as follows:
dat_mean <- data.frame(x = 1, y = 50)
dat <- data.frame(x = 1, y = rnorm(100, 50, 10))
ggplot(data = dat, mapping = aes(x, y) ) +
geom_point() + # inherits data and aesthetics
geom_point(data = dat_mean, size = 7, color = "red") # overrides data, but uses inherited aesthetics
Notice how we specify the aesthetics only once, since the dat_mean frame has the same same aesthetics.
ggplot( mapping = aes(x, y) ) + # aesthetics are common
geom_point(data = dat) +
geom_point(data = dat_mean, size = 7, color = "red")
We have to specify the parameter name for the aesthetics (mapping), because it is the second param after data in the ggplot function. The x and y parameters in the aesthetics function are so commonly used they are often not named. Also we could have passed 1 as x in the aesthetics, without repeating in our data frame
dat_mean <- data.frame(y = 50)
dat <- data.frame(y = rnorm(100, 50, 10))
ggplot(data = dat, mapping = aes(1, y)) +
geom_point() + #Inherits the data and aesthetics
geom_point(data = dat_mean, size = 7, color = "red") #overrides the data and inherits aesthetics
Plotting multiple vertical samples
A shorthand for the vector he is creating here would be
rep(c(1, 9, 15), 1, each = 100)
This will repeat the vector 1, 9, 15 one time, but each element repeated 100 times. If you know of other short ways to do this please comment below.
Another way to plot the data without adding the x values into the data frame is to split into separate point layers like this with different x specified in the mapping (aesthetics).
dat <- data.frame(y = c(rnorm(100, 50, 10),
rnorm(100, 30, 10),
rnorm(100, 78, 10)))
dat_means <- data.frame(x = c(1, 9, 15), y = c(50, 30, 78))
ggplot() +
geom_point(mapping = aes(1, y), data = dat[1:100, , drop = F]) +
geom_point(mapping = aes(9, y), data = dat[101:200, , drop = F]) +
geom_point(mapping = aes(15, y), data = dat[201:300, , drop = F]) +
geom_point(mapping = aes(x, y), data = dat_means, size = 7, color = "red")
I know there must be ways to facet the data in other ways using ggplot, but I’m still learning and may update later
Samples along a line
You already saw how to do the simple repeats above or how to avoid using them at all using the aesthetics.
sapply
sapply()
is the more friendly version of lapply()
a utility for applying an function over a list or a vector. sapply()
returns a vector or matrix where possible.
In many case there is no need to specify function(x) call_fun(x)
. You can simply supply call_fun
for instance. He could have done this:
x <- c(2, 4, 9, 15)
sapply(x, sqrt)
As long as the elements in the supplied data can be passed to the function as FUN(x[[index]], ...)
most of the time that would be fine, but as the documentation points out is.numeric
(and potentially other is.*
functions), may give you different results if you expected the indexing to be simple []
indexing instead of [[]]
(indexing with tag removal)
A note on infix operators for instance x ^ 2
, this can be rewritten as `^`(x, 2)
so in that case
x <- c(2, 4, 9, 15)
sapply(x, `^`, 2)
Note that I just added the second parameter to ^
as an additional parameter. It will automatically be placed by index in the function after the indexed value in the input. The nice thing about this is that the operator written as a parameter in the form (x, `^`, 2)
, looks surprisingly like the normal infix notation x ^ 2
.
Now, in the case where he calculates rnorm we are a bit out of luck as far as I know. I think in that case we have to use the function(x) rnorm(1, x, 10)
format. If you know of an elegant alternative then please comment below
Cloud of points
The point of this lesson is near the end, where he tries to show if we base a population off a line and that we will use regression to determine the line. He is essentially engineering a “random” set based off a line. I assume he will show that linear regression will fit a line to the data and match the “source” line.
He creates a normal population and plots it along a line, then takes one sample offset in y to add to the plot.
Once you grasp the step by steps of the this lesson here is a shorter way to create the pointcloud along the line. I’ve elminated the temporary vectors
data_scatter <- data.frame(x = rnorm(100, 10, 5))
data_scatter$y <- sapply(data_scatter$x, function(x) rnorm(1, 3 * x + 1, 10))
trend_line <- data.frame(x = c(-5, 25))
trend_line$y <- 3 * trend_line$x + 1
ggplot(data = data_scatter, mapping = aes(x, y) )+
geom_point() + # inherits data and aesthetics
geom_line(data = trend_line) + # inherits aesthetics
scale_x_continuous(limits = c(-10, 30)) +
scale_y_continuous(limits = c(-20, 80))
Note that the standard deviation in x and y are different. also note that if you place the endpoints of the trend line outside the scale limits the line is not drawn.
Father and son heights
I prefer str()
over head()
. It still shows a few values for each variable (column), but in a transposed output.
Equation of a line
Simple refresher on getting a line using the slope-point method
Residual visualizations
One way to create the two data frames is just to use assign
to make clone of the dataframe and add the extra column.
assign("dat", father.son)
colnames(dat) <- c("x", "y")
dat$group <- 1:nrow(father.son)
assign("means", dat)
means$y <- means$x + 3
In this lesson he explains that residuals are the deltas in Y between measured and trend line. Notes on squaring residuals: It emphasizes large differences and ensures a positive metric
Sum of squared residuals
Pretty straightforward explanation of the sum of squares.
Least Squars Line
lm
or Fit Linear Model function takes a formula as its first parameter. We saw formulas in Kaggle Titanic tutorial follow-along where we used them as parameters to the tree and random forest generator functions.
In this case we have y ~ model
where we are attempting to predict y based off a model of other driving fields separated by +
. We saw that in the Kaggle Titanic Turional. Note that there are advanced operators such as *
for factor crossing, but that is an advanced topic. In this case we have a simple linear form so we are tracking the dependenct of y on x so the model is just x
so the formula is simply y ~ x
.
If you’d like to fully qualify the formula parameter simply add formula=
in front of the formula definition
Once you get the slope and intercept there is no need to recreate the means dataframe. You can simply update the y’s without dissecting and reassembling the dataframe
means$y <- slope * means$x + intercept
Doing this preserves the group data as well.
Predicition
Not much here except that the regressed line does a best (least squares best) approximation of population regressed to a linear function.
Readin in Excel Files
Saving as csv. Pretty much what you should know from the Kaggle Titanic tutorial follow-along. R-Studio has the ability to bring in Excel files directly if you select that from the “Import Dataset” menu (using the readxl
package). Also, note that csv is also included under the “Import Dataset”. In that case the read_csv
package will be used to read the csv, that package has a slightly different syntax than the built in read.csv
Conclusion
This course is just OK. I already knew how linear regression worked, but maybe you found value in this course
Leave a Comment