Getting Started with R - Part 7: Factors - Levels and Labels
Factors are like vectors but with values classed into levels. When vectors have a limited number of repeated values they fit the bill.
I am posting this tutorial as I learn R. I will respond to feedback for errata in the comments.
How to create a factor
If we have a vector of values where the values can be only one of a few values it becomes a good candidate for a factor. In essence we will convert a vector into a new kind of vector we call a factor that internally has integer values and labels associated with each integer value.
repeat_vector <- c('I', 'often', 'repeat', 'repeat', 'myself', 'I', 'often', 'repeat', 'repeat') # Jack Prelutsky
repeat_factor <- factor(repeat_vector)
repeat_factor
yields
[1] I often repeat repeat myself I often repeat repeat
Levels: I myself often repeat
In this case there were four levels for the vector strings: I
, myself
, often
and repeat
. The levels were simply created by taking the unique strings in the vector, and then sorting them. We could have also provided levels
during the factor’s construction. A little later I will show you how to change the ordinal positions of the levels. To see the levels of a factor you can simply call levels(afactor)
to get the list of factors in order.
Notice! The levels
are used to class the values into factors. We can also pass labels
to the factor constructor, if you do this and query the levels
of the resulting factor you will see the values you passed in labels
. During construction levels
are used for classing, labels
are for naming the levels in the resulting factor differently than from the levels used for mapping during construction.
Let us take a look at the attributes of the factor object
attributes(repeat_factor)
yields
$levels
[1] "I" "myself" "often" "repeat"
$class
[1] "factor"
I’ll get to these attributes in a moment, but first let us use labels
to change the level names in our new factor. We will also pass the associated levels
for classing the input. Again, please do not confuse these two parameters. Levels are matched against input to categorize, it will also be used to name the levels except if we pass labels
. Once the vector is constructed with lablels
the levels names passed in constructor are the names of my levels.
repeat_factor_labeled <- factor(repeat_vector, levels = c("I", "often", "repeat", "myself"),
labels = c("Jack", "frequently", "repeats", "himself" ) )
repeat_factor_labeled
returns this
[1] Jack frequently repeats repeats himself Jack frequently
[8] repeats repeats
Levels: Jack frequently repeats himself
You may have noticed that when I listed the levels I swapped "repeat"
and "myself"
for the previously sorted order (look at how I ordered them in the levels=
). You can see that the levels in the new factor are also listed in the new order, but with the new names. Run attributes(repeat_factor_labeled)
and you will see that the difference. Compare the result of the attributes
function to when you ran it before against repeat_factor
.
Changing levels
You can set the levels labels after constructing a factor. This would be similar to passing in the labels
parameter. We can pass a full new vector or just labels the labels of the levels selectively. Let us just change factor label 1 from “Jack” to “Mr. Prelutsky”.
levels(repeat_factor_labeled)[1] <- "Mr. Prelutsky"
repeat_factor_labeled
There are some advanced actions you can take to combine levels in a factor using the levels()
function. See Cleaning up factor levels (collapsing multiple levels/labels) for some advanced factor level cleanup.
Ordered factors
We may have factors where the ordinal positition of the values are not important. In the example above the factors for the words above do not have a specific order to them. We can change that by adding the order=TRUE
optional parameter to the constructor. First let us find a factor where we have order
days_of_week = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
days_of_week_factor <- factor(days_of_week, order=TRUE, levels=days_of_week)
days_of_week_factor
shows us that levels have order, by displaying <
between them
[1] Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < Friday < Saturday
Since my vector was already unique and in the correct order I could use it for my levels. If I didn’t my levels would just be ordered alphabetically. If I need to take a subset of a factor I can do the same as with vectors
tuesday <- days_of_week_factor[3]
tuesday
days_of_week_factor[c(2:7, 1)] # days of the week starting with Monday instead of Sunday
Ordered vectors allow us to do inequalities in expressions
days_of_week_factor[3] > days_of_week_factor[1]
sunday <- days_of_week_factor[1]
days_of_week_factor[days_of_week_factor > sunday]
Try the same code above but omit the order=TRUE
, you’ll get a vector of NA
s. To understand why compare two days days_of_week_factor[2] > days_of_week_factor[1]
. There is no order so >
has no meaning leading to an NA
result
Factors and matrices
All is nice in our factor space, but let us try and use factors in our matrix
#sunday <- days_of_week_factor[1]
month_names <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
calendar_matrix <- matrix(nrow=31, ncol=12, dimnames = list(1:31, month_names))
months_31days <- c(1,3,5,7,8,10,12)
months_30days <- c(4,6,9,11)
calendar_matrix[1:31, months_31days] <- sunday
calendar_matrix[1:30, months_30days] <- sunday
calendar_matrix[1:28, 'Feb'] <- sunday
# Invalid calendar positions are now NA, valid ones are are the factor value of Sunday
oldw <- getOption("warn")
# We have to turn off a warning because the days of the week do not even divide - there is a remainder of 1
options(warn = -1)
calendar_matrix[!is.na(calendar_matrix)] <- days_of_week_factor[c(2:7, 1)] # assuming Jan 1 fell on a Monday
options(warn = oldw)
calendar_matrix
Did you get what you expected? You see that the days of the week are populated by the ordinal values. This gives us a little peak of what is really going on inside of the factor: We have the integer values that simply index their names. Matrices are really not factors and giving them factors makes them store ordinal values. We will learn of another structure called a dataframe later that will make some of these complex scenarios simpler
What if we wanted to display our matrix with the labels of the factor? That is a bit tricky so here is the step by step breakdown. We can get our level names of our factor like this: levels(days_of_week_factor)
and we can index the values
levels(days_of_week_factor)[2] #Gives us "Monday"
levels(days_of_week_factor)[calendar_matrix] #Gives use the strings of the calendar_matrix, but as a vector
So we can get the calendar’s values in 1-dimensional form so how can we get it back as a matrix. Simply we assign it to matrix with the same shape.
matrix(levels(days_of_week_factor)[calendar_matrix], nrow=31, ncol=12, dimnames = list(1:31, month_names))
We could have also made a copy of our other calendar matrix and assign, like this
calendar_matrix_names <- calendar_matrix
calendar_matrix_names[] <- levels(days_of_week_factor)[calendar_matrix]
calendar_matrix_names
Leave a Comment