Getting Started with R - Part 8: Data frames - Construction, Subsetting and Structure

5 minute read

Vectors and matrices are designed to hold single types. When we need to mix data types in a tabular form we can use a data frame.

I am posting this tutorial as I learn R. I will respond to feedback for errata in the comments.

What is a data frame?

A dataframe allows us to mix different values in a tabular form similar to what you might find in a database table. In statiscal context the columns are “data points” and the rows are “observations”. If we think of a data frame as a database table then the columns are our “fields” and the rows are our our “records”. Each column can have a different data type.

Constructing a data frame

Most often dataframes are imported from csv (comma separated values), but you can create data frames inside of R. Here is a data frame with some basic types

city_name <- c( "Columbus", "Cleveland", "Cincinnati")
latitude <- c(39.98, 41.48, 39.14)
longitude <-  c(-82.99, -81.68, -84.51)
population <- c(860090, 385809, 298800)
nickname <- c( "The Arch City", "America's North Coast", "The Queen City")
on_state_border <- c(F, T, T)

oh_city_df <- data.frame(city_name, longitude, latitude, population, nickname, on_state_border)
oh_city_df

returns this nice dataframe

   city_name longitude latitude population              nickname on_state_border
1   Columbus    -82.99    39.98     860090         The Arch City           FALSE
2  Cleveland    -81.68    41.48     385809 America's North Coast            TRUE
3 Cincinnati    -84.51    39.14     298800        The Queen City            TRUE

Indexing a data frame

Subsetting and selecting works very much like was saw with matrices, with a few execeptions I’ll cover in a moment

oh_city_df[,1:3] # Select city names and location
oh_city_df[3,] # Select all observations for city 3 (Cincinnati)

However, when we switch to single indexing we are essentially indexing by column. You’ll remember that matrices when indexed on a single dimension essentially would traverse the data row-wise to locate an indexed value.

oh_city_df["nickname"] #output the city nicknames
oh_city_df[1] # returns column 1 the city_name
oh_city_df[1:3] #returns city names and location

Try an out-of-bounds index. Notice how it returns an error unlike our vector and matrix that returns an NA. Here you will receive an error saying “undefined columns selected”. Essentially single dimension indexing [index] is the same as [,index].

We can also select our data as vectors from our dataset using the following syntax

oh_city_df$nickname
oh_city_df$population
oh_city_df$on_state_border

Try an invalid column name? Did you get what you expected? In this case the $ does not throw an invalid column name error, but returns an NA.

BTW, did you notice something about the nickname’s vector output? If not run this: is.factor(oh_city_df$nickname). It was converted to a factor automatically!

Naming rows

Looking at our dataframe it seems that the numbering of the rows are a waste, we’d rather just have our city names insted of numbers in those positions.

We could constructor our dataframe by telling it that one of the columns are actually our row names.

oh_city_df <- data.frame(city_name, longitude, latitude, population, nickname, on_state_border, row.names=1)
oh_city_df

gives us this more readable dataframe:

           longitude latitude population              nickname on_state_border
Columbus      -82.99    39.98     860090         The Arch City           FALSE
Cleveland     -81.68    41.48     385809 America's North Coast            TRUE
Cincinnati    -84.51    39.14     298800        The Queen City            TRUE

Careful! When selecting a vector as names for the data frame removes the column’s data. If you coded using indexes your indexes may have shifted

oh_city_df["nickname"] #output the city nicknames
oh_city_df[1] # returns column 1 the logitude
oh_city_df[1:3] #returns city location and population

This is one reason it may better to use name based indexing, to prevent inadvertent errors with numeric indexes

oh_city_df["Cincinnati", "nickname"]
oh_city_df["Columbus", c("longitude", "latitude", "on_state_border")]

As with matrices we have the full logic of matches to our disposal

oh_city_df[ oh_city_df$on_state_border,  ]

Give use the cities on the state border. The reason this simple logic works is that the oh_city_df[ oh_city_df$on_state_border is a vector of type logical and it works as a selector similar to what we saw in my earlier post on vectors. Lets try some of the other fields

oh_city_df[ oh_city_df$longitude< 40,  ]

Returns all the fields for the cities south of 40.

oh_city_df[ (oh_city_df$population > 300000) & (oh_city_df$population < 600000)
            , "population", drop=F] 

Returns the city that has a population between 300,000 and 600,000. Notice the optional drop paramter, I pass that to prevent the result to be converted to a vector. Try it without that to see what happens

Structure information

Often, data frames are much larger than this dataset and viewing the data can be difficult. Often we just want to see a basic summary of information. To assist and getting a general overview of the data frame you can use the str() structure function

str(oh_city_df)

oh_city_df[1:3] #returns city location and population

shows us the structure of our dataframe

'data.frame':	3 obs. of  5 variables:
 $ longitude      : num  -83 -81.7 -84.5
 $ latitude       : num  40 41.5 39.1
 $ population     : num  860090 385809 298800
 $ nickname       : Factor w/ 3 levels "America's North Coast",..: 2 1 3
 $ on_state_border: logi  FALSE TRUE TRUE

Observations are our rows, variables are our columns. We can see all our columns with a sample of their data (our data sample was small so all data was shown except for nickname). Lets say we want to change nickname back to “characters” type and we want population to be an integer, how can we do that? First lets find the source of the problem. First we’ll examine our original vectors

class(population)
class(nickname)

population was numeric to begin, but nickname was converted from characters to a factor. So doing this

population <- as.integer(population) #convert from numeric to integer
oh_city_df <- data.frame(city_name, longitude, latitude, population, nickname, 
                          on_state_border, row.names=1)

str(oh_city_df)

This Fixes the population, but as.characters(nickname) does not change our input at all so it calling that in the constructor will not help. To fix this we need to pass stringsAsFactors=F.

Beware! Passing as.integer(population) as column set adds a column of NULLS with exactly that name. This is why I had to convert population before passing it to the data frame constructor

Ordering Data Frames

At the end of Part 3: Vector Operations I briefly discussed using the order() function on a vector to get a vector of indices in order. We can apply what we have learned there to data frames as well

oh_city_df[ order(oh_city_df$latitude), ]

Gives us our cities ordered by latitude. This works simply becase the input to the rows in the selector is the ordered indices. If its not clear run order(oh_city_df$latitude) and notice that it returns a vector of values in the order 3, 1, 2

Leave a Comment