Data type and structure

Eduardo Grajales
Tuesday, Apr 6, 2021

You’ve already seen how to work in R and Rstudio, you’ve learned to configure your working directory, install libraries and call them, about scripts and how useful they’re to work in R.

That is, you already have your first steps of knowledge about R, you have already increased 1% of your fitness, you have 1% more probability in your favor, and if you have not read it, here I leave it … and now a bit of theory.

How the promised is debt, in post titled What do I need to started in R?, You saw that there are 7 types of data (integer, numeric, logic, character, factor, NA and NULL); Now, in this post we will start with verification of data types, that is, you will know what types of data R is reading.

To achieve this objective, you need to have a database to study, in R you can find several bases to practice, but this time we will use iris; which is a database with values associated with length and width of sepals and petals of various plant species that we can explore, then you will learn how to upload your own file to R.

Data exploration

We’re going to start this post by giving a short summary about data exploration, but keep in mind that we will go deeper in a future post. To carry out this task there are some own Rbase functions, such as head(), tail(), among other. However, there are many other functions to be able to do an adequate exploration and transformation of data.

Let’s start with exploring iris database a bit and see what elements make it up

Note: Everything after the sign # in our code, R skips it, just like when your sweetie skips your good morning messages.

head(iris) # Used to display first 6 rows of the data.

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(iris) # Instead it shows last 6

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

We also have View() function, which is quite a useful function; moreover, I consider that it should be the first to use when loading our data to R in order to visualize data in a new window and thus be able to make a first inspection. Let’s do next exercise and use this function

iris
View(iris)

Did you notice any difference when calling only iris and use View(iris)? I suppose that you managed to see the great difference that exists in inspecting data in an Rstudio window to having to see it on console.

At this point, I would like to make you a clarification: there are also specialized libraries in data exploration.

For example, tidyverse library which in turn contains a complete set of libraries including dplyr, is one of the most used libraries by data scientists because of its easy handling and how they call it because of its “Learning Curve” which refers to the amount of information available and the number of people studying it.

Another very interesting function to know the types of data that our dataset contains is: class(), this function will show us in console what type of data we have, since at first glance it is very difficult to know it, and if we carry out this pre-process before starting our analysis, we will be able to avoid several problems with several R libraries, that are very clear and very sensitive with type of data that we need to do our analyzes.

Attention attention!!!
The <- symbol we use it to assign variables, or in other words, to name things. If we digitize Alt + - This symbol will appear in Rstudio.

number <- 357 # As it is a number we hope that what we are going to obtain is "numeric"
class(number)

## [1] "numeric"

name <- "Aotus"# How is it in quotes Aotus, we hope you read it as character
class(name)

## [1] "character"

combined <- c(2,1.62, "allele", "HardyWeinberg") 
# "c" letter is used to concatenate (group)
class(combined)

## [1] "character"

What happened to our variable called combined?

There we can see that two things happened; first, we have different types of data, it is clear to us that there are data of type numeric (2 and 1.6) and character (“Allele”, “HardyWeinberg”), but the function showed that they were all character.

Why is this happening?

This occurs implicitly in R by something called coercion ; that is to say, data is forced to transform to another type of data, or as old saying goes: “Whoever walks among honey, something sticks”

This coercion has a hierarchical order, that is, R will transform our data according to an established order and it does not happen randomly. Order we can see below:

logical -> integer -> numeric -> character

Now that you know what happened and that this can happen in your projects, it is necessary to pay more attention to your data, and to know importance of reviewing it before starting to run a code from a library. On the other hand, you must bear in mind that coercion occurs without us wanting it, but it will not always be case, we can transform data as long as it is logical to do so using as() family (We will see it later).

Continuing with our exercises, I’m going to show you one more example using class() function, accompanied by lapply() function (which we will see in another post more in depth, if you want to know how it works, in your console write ?lapply) which will help us evaluate each columns of the database data. Here we intend to explore what type of data we have in our data frame or database.

lapply(iris, class)

## $Sepal.Length
## [1] "numeric"
## 
## $Sepal.Width
## [1] "numeric"
## 
## $Petal.Length
## [1] "numeric"
## 
## $Petal.Width
## [1] "numeric"
## 
## $Species
## [1] "factor"

How can you see, with this couple of functions working hand in hand, we can know what types of data we have in our dataframe, this will greatly facilitate our workflow.

Data transformation or force coercion.

Now if, with everything you already know, you will better understand data transformation through examples, then, let’s make it little friend:

Attention attention!!!
The $ symbol we use it in R to “call” a column in our database

In this example it will be a little clearer. We have a database called Project and places that were worked were numbered, instead of having given them a name, what a problem! Thus, we will look at our first data and then we will see class type of the column Place, as we see in this code:

head(Project)

##   Place Sepal.Width Petal.Length Petal.Width Species
## 1     5         3.5          1.4         0.2  setosa
## 2     5         3.0          1.4         0.2  setosa
## 3     5         3.2          1.3         0.2  setosa
## 4     5         3.1          1.5         0.2  setosa
## 5     5         3.6          1.4         0.2  setosa
## 6     5         3.9          1.7         0.4  setosa

class(Project$Place)

## [1] "numeric"

How could you observe this column of data is identified by R as numeric, but we already know that they are not numbers, but rather that they are names of sites, for that reason, we have to transform them so that R does not mix them up and perhaps generate an error in our analysis. To do this, we will use as.character() function as we see in the next example

char <- as.character(Project$Place)
class(char)

## [1] "character"

And our intention was ready!

But… So I can only transform from numeric to character?

You can also perform other types of coercion, depending on what you need or as requested using commands that you will see in the next table:

Function to transform	- Coercion type or transformation
as.integer()	Integer
as.numeric()	Numeric
as.factor()	Factor
as.logical()	Logical
as.null()	Null

How to use them? I tell you that they are used as we did with as.character()

Conclusion on data type and transformation.

So far you have learned to identify data types using class() function and also about coercion that occurs automatically in R when creating a vector, which we think has different types of data; you also learned about as() family to force coercion or transform data to the type of data you are needing.

Additionally, you learned about basic data exploration using View(), head() and tail() functions, which will help you a lot to avoid inconveniences with R and thus be able to know how your database is organized or composed and you also saw, that in R many ways to do same, so do not worry, you settle with what else you like it.

Data structure

The data structure are objects that contain one type of data or several types of data, and have different characteristics such as dimensions if they are homogeneous or heterogeneous .

Homogeneous	Characteristic	Heterogeneous	Characteristic
Vector (One dimension)	Collection of one or more data of the same type. Its dimension is same amount of data it has	List (One dimension)	Like vectors, it has only one dimension, but its data can be of different types and even contain structures.

Matrix (Two Dimensions)	Multidimensional vector. It can only contain data of a single type and has only two dimensions, High and Length	Data Frame (Two Dimensions)	They have two dimensions and can contain data of different types, it is the most common for data analysis (more flexible version of a matrix).
Array (n dimensions)	Same characteristics as a matrix, but it can have more than two dimensions

Now, we will see how each of these structures works with useful examples.

Vectors

Vectors are the simplest data structure in R, as we saw in previous table, their dimension depends on the number of data it contains and it can only have one data type.

How can you tell that you have a vector and how can you create vectors?

You already know class() function that helps us to identify data type we have and sometimes data type structure; however, family of is() functions It is the most suitable for defining data structure as next code:

## [1] 3

is.vector(3)

## [1] TRUE

class(3)

## [1] "numeric"

How can you tell, when using both functions they give us different results, since class() function will indicate type of data and is() function. Being logical, it will tell us if it is (TRUE) or (FALSE), this type of response is known as Boolean.

Another example:

three <- 3
class(three)

## [1] "numeric"

is.vector(three)

## [1] TRUE

The way to create a vector is very simple; What’s more, all this time you have been watching it, we need <- operator and if you are going to add several data of same type you use c(). On the other hand, there is something very important that you must learn and that is way in which you are going to assign names to data structure. The important thing is that they mean something to you and that someone else can understand it.

More examples:

num <- c(1233,555,88,99,17)
class(num)

## [1] "numeric"

is.vector(num)

## [1] TRUE

nam <- c("Emma", "Charlotte", "Sophia", 
          "Aaron", "Caleb", "Wallace")
class(nam)

## [1] "character"

is.vector(nam)

## [1] TRUE

Now it’s your turn to practice, create different vectors, with different types of data, explore everything you can do, for example try adding numerical vectors and see what happens, as in next example:

ex1 <- c(112,667,99,56,47,12)
ex2 <- c(22,65,23.8,99,101,41)

sum <- ex1 + ex2
sum

## [1] 134.0 732.0 122.8 155.0 148.0  53.0

What happened? It remains for you to describe in the best way what happened. What if vectors do not have same number of data? Tell us!

Matrices y arrays

These structures are used in mathematics and statistics; What’s more, it is one of the most common and requested structures in many of packages that we usually use in Biology, and it is also one of reasons why we despair with R and want to hit keyboard like crazy.

Now, how to recognize and create matrices?

It’s so easy how to write matrix(), this function accepts two arguments, nrow (number of rows) and ncol (number of columns) to indicate number of rows and columns of matrix that you are going to generate, as in the next code

We can do this without specifying number of columns and rows. For this example you are going to generate a matrix from 1 to 12 as follows: 1:12

matrix(1:12)

##       [,1]
##  [1,]    1
##  [2,]    2
##  [3,]    3
##  [4,]    4
##  [5,]    5
##  [6,]    6
##  [7,]    7
##  [8,]    8
##  [9,]    9
## [10,]   10
## [11,]   11
## [12,]   12

Here R automatically shows us a matrix with 12 rows and a column, but suppose that you want 6 rows and 2 columns, for this we will enter as next code

matrix(1:12, nrow = 6, ncol = 2)

##      [,1] [,2]
## [1,]    1    7
## [2,]    2    8
## [3,]    3    9
## [4,]    4   10
## [5,]    5   11
## [6,]    6   12

Now create a matrix with numbers from 1 to 12 (1:12) with a nrow of 6 and ncol of 4, see what happens.

We can also create matrices by joining vectors with cbind() (join as columns) and rbind() (join as rows) functions as next form:

vec1 <- c(1:5)
vec2 <- c(10:15)
vec3 <- c(20:25)
vec4 <- c(35:40)

matrix1 <- cbind(vec1, vec2, vec3,vec4)
matrix1

##      vec1 vec2 vec3 vec4
## [1,]    1   10   20   35
## [2,]    2   11   21   36
## [3,]    3   12   22   37
## [4,]    4   13   23   38
## [5,]    5   14   24   39
## [6,]    1   15   25   40

matrix2<-rbind(vec1, vec2, vec3, vec4)
matrix2

##      [,1] [,2] [,3] [,4] [,5] [,6]
## vec1    1    2    3    4    5    1
## vec2   10   11   12   13   14   15
## vec3   20   21   22   23   24   25
## vec4   35   36   37   38   39   40

As you can see, these functions are very useful to us and allow us to carry out specific tasks.

With matrices you can also perform arithmetic operations.

matrix1 + 1

##      vec1 vec2 vec3 vec4
## [1,]    2   11   21   36
## [2,]    3   12   22   37
## [3,]    4   13   23   38
## [4,]    5   14   24   39
## [5,]    6   15   25   40
## [6,]    2   16   26   41

matrix1 * 2

##      vec1 vec2 vec3 vec4
## [1,]    2   20   40   70
## [2,]    4   22   42   72
## [3,]    6   24   44   74
## [4,]    8   26   46   76
## [5,]   10   28   48   78
## [6,]    2   30   50   80

matrix2 ^ 3

##       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
## vec1     1     8    27    64   125     1
## vec2  1000  1331  1728  2197  2744  3375
## vec3  8000  9261 10648 12167 13824 15625
## vec4 42875 46656 50653 54872 59319 64000

As you can see, arithmetic operation is performed for each data in the matrix

Now you will see data structure that is really most used by us in biology, data frames. But… Why is it the most used? It is the most used, because it is structure that best suits our needs; In addition, that is way in which we save our data in excel and then generally load file.csv

Generally rows in a data frame represent individuals or observations and columns represent attributes, traits or variables.

What does a data frame look like?

The iris data frame that we have been using, it is a clear sample of organization and data types that a data frame can contain.

But, how can I create a data frame in RStudio?

Quickly and easily, we will use data.frame() function, in which we can enter a series of vectors that will make up our data frame, in next example we will enter 4 vectors with 4 different data classes, this is one of advantages of a data frame, we can use different types of data.

Attention attention!!!
Remember that data frames require that variables be of same length. For this reason, we have to make sure that number of arguments passed to each vector are same

my_df <- data.frame(
  "integer" = 1:4, 
  "factor" = c("a", "b", "c", "d"), #  Note that characters must be between ""
  "number" = c(1.2, 3.4, 4.5, 5.6),
  "chain" = as.character(c("a", "b", "c", "d")))
  
my_df

##   integer factor number chain
## 1       1      a    1.2     a
## 2       2      b    3.4     b
## 3       3      c    4.5     c
## 4       4      d    5.6     d

So we can see that our data frame has been created successfully, again we invite you to play with these functions and become more friends with them.

Finally we can use str() function to know structure of our data in a data frame, as we can see below

str(my_df)

## 'data.frame':    4 obs. of  4 variables:
##  $ integer: int  1 2 3 4
##  $ factor : chr  "a" "b" "c" "d"
##  $ number : num  1.2 3.4 4.5 5.6
##  $ chain  : chr  "a" "b" "c" "d"

Now you can observe structure of each of our data, 4 observations of 4 variables, in which we specify data type of each column or variable.

Lists

To end our nutritious post today, we will talk about lists, which are one-dimensional data structures, they only have length, but unlike vectors, each of their elements can be of a different type or even of a different class, for what are heterogeneous structures. So we can create lists that contain atomic data, vectors, matrices, arrays, data frames, so length of a list is equal to number of elements it contains, regardless of what type or class they are.

To create a list we use list() function, which will ask us for elements we want to include in our list. For this structure, dimensions or length of elements that we want to include in it do not matter, as we see below:

my_vector <- 1:10
my_matrix <- matrix(1:4, nrow = 2)
my_df     <- data.frame("num" = 1:3, "letter" = c("a", "b", "c"))

my_list <- list("a_vector" = my_vector, "a_matrix" = my_matrix, "a_df" = my_df)

my_list

## $a_vector
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $a_matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## $a_df
##   num letter
## 1   1      a
## 2   2      b
## 3   3      c

As we can see we have our data entered in our list and it shows us your organization. Now we will proceed to call a single element of our list, we will do this by entering created list name and then we will type $ (Dollar sign) operator, as we will do in next code

my_list$a_vector

##  [1]  1  2  3  4  5  6  7  8  9 10

# We call "a_vector" element

Attention attention!!!
Please note: it is not possible to vectorize arithmetic operations using a list!

Resume

Ready!!!

We have finished this interesting and nutritious post, which was full of many things to learn and many others to digest, we went through a brief learning about data exploration (which we will discuss more thoroughly, in our next installments), until all information about structure and data types, you already know that there are numeric data types, characters, among others, and their organization such as vectors, data frames and lists.

With this information we hope that you will get to know your data better and how R helps us to treat them. Do not be afraid and try to play with functions and data frames that R provides us, it does not matter that you are having errors, it is normal, we all go through these and if you need something, we will be here !!!

Bibliography:

R for beginners

Complete guide to Markdown and its integration with R

RStudio for Descriptive Statistics in Social Sciences

Data manipulation and reproducible research in R

Data Science with R