# Create a vector of character values
<-
food_names c("Rice",
"Maize",
"Beans",
"Cassava",
"Potatoes",
"Sweet potatoes",
"Wheat")
#Create a vector of numeric values
<- c(0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01)
consumpution
# Create a vector of logical values
<- c(TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE)
is_staple
# Create a vector of mixed values
<- c(5.2, TRUE, "CA") mixture
3 Data Structures
A data structure in R is an R object which holds one or more data objects, a data object will be a data type, such as we have encountered in section 1 (numeric, character, etc). In this script we introduce vectors, factors, matrices, data frames and lists. The examples and exercises should help you to understandbetter how R holds and manages data.
3.1 Vectors
A vector is a series of homogeneous values of a variable (e.g. Foods from an HCES survey). The easiest way to form a vector of values in R is with the "combine"
function c()
. An example of a vector of character values (food_names) is shown below:
3.2 data frames vs tibbles
In R, data frames and tibbles are two common data structures used to store tabular data. While they are similar in many ways, there are some important differences to keep in mind.
3.2.1 Data Frames
Data frames are a built-in R data structure that is used to store tabular data. They are similar to matrices, but with the added ability to store columns of different data types. Data frames are created using the data.frame()
function, and can be manipulated using a variety of built-in R functions.
3.2.2 Tibbles
Tibbles are a newer data structure that were introduced as part of the tidyverse
package. They are similar to data frames, but with some important differences. Tibbles are created using the tibble()
function, and can also be manipulated using a variety of built-in tidyverse functions.
One of the main differences between data frames and tibbles is how they handle column names. In a data frame, column names are stored as a character vector, and can be accessed using the $
operator. In a tibble, column names are stored as a special type of object called a quosure
, which allows for more flexible and consistent handling of column names.
Another difference between data frames and tibbles is how they handle subsetting. In a data frame, subsetting using the [ ]
operator can sometimes lead to unexpected results, especially when subsetting a single column. In a tibble, subsetting is more consistent and predictable, and is done using the [[ ]]
operator or with user friendly dplyr
function e.g. filter
, select
.
Overall, while data frames and tibbles are similar in many ways, tibbles offer some important advantages over data frames, especially when working with the tidyverse package.
Let us make a data frame using the data.frame()
function. We will use the vectors we created above as the columns of the data frame. Note that the vectors must be of the same length, otherwise the data frame will be filled with NA
values to make up the difference.
# Create a data frame
<-
food_df data.frame(
food_names = c(
"Rice",
"Maize",
"Beans",
"Cassava",
"Potatoes",
"Maize",
"Wheat"
),consumption = c(0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01),
is_staple = c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE),
stringsAsFactors = TRUE
)
# Print the data frame
print(food_df)
Let us make a tibble using the tibble()
function. We will use the vectors we created above as the columns of the tibble. Note that the vectors must be of the same length, otherwise the tibble will be filled with NA
values to make up the difference.
# Create a tibble
<- tibble::tibble(
food_tb food_names = c(
"Rice",
"Maize",
"Beans",
"Cassava",
"Potatoes",
"Maize",
"Wheat"
),consumption = c(0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01),
is_staple = c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
)
# Print the tibble
print(food_tb)
3.3 Factors
Note that a factor
is actually a vector, but with an associated list of levels
, always presented in alpha-numeric order. These are used by R
functions such as lm()
which does linear modelling, such as the analysis of variance. We shall see how factors can be used in the later section on data frames.
Let us create a factor from a vector of character values. We can do this using the factor()
function. The first argument is the vector of character values, and the second is the list of levels. If we don’t specify the levels, R
will use the unique values in the vector, in alphabetical order.
3.3.1 Coercing a vector to a factor
Example of converting the food_names
vector to a factor:
# Create a factor without providing the levels argument
<- factor(food_names)
food_names_factor_1 # Print the factor
print(food_names_factor_1)
# Create a factor from a vector of character values
<-
food_names_factor_2 factor(
food_names,levels = c(
"Rice",
"Maize",
"Beans",
"Cassava",
"Potatoes",
"Sweet potatoes",
"Wheat"
)
)
# Print the factor
print(food_names_factor_2)
3.3.2 Coercing a vector to a factor in a data frame
Example of converting the food_names
vector to a factor in a data frame:
library(dplyr)
# Use the food_tb data frame created above and convert the food_names column to a factor
<- mutate(food_tb, food_names = factor(food_names))
food_tb
# Print the data frame
print(food_tb)
3.4 Summary
There are other data structures in R, e.g. Matrix and lists but these are the most common. We will now look at some of the operations we can perform on vectors and data frames in the future sections.
But first,we introduced the dplyr
package above. This is a package
which provides a set of functions for manipulating data frames. We will use it extensively in this book. We can use the mutate()
function to add a new column to a data frame. In this case we are adding a new column called food_names
which is a factor version of the food_names
column in the data frame.This means we introduced a new function
mutate()
and a new package
dplyr
.
In the next section we define what are packages
and functions
.