3 Enter The Tidyverse: What’s Next?

3.1 Refresher

Previously, we learned what a tibble is and how it stores data in observations/rows and variables/columns. We also learned about 7 fundamental data manipulation functions to allow us to edit these tibbles.

filter() - selects observations based on a criteria/predicate
slice() - selects obersevations based on number
arrange() - re-arranges rows based on a variable
select() - picks certain variables
mutate() - changes variables or creates new ones
summarise() - aggregates variables based on some functions
group_by() - groups data sets by variables

3.2 Pivoting

In general, datasets come in two forms, long or wide. Long data has many rows and fewer columns, and so it looks longer (vertically) if shown in a spreadsheet. Wide data has more columns and fewer rows (hence, it is wider). Data can be represented in both ways and each has advantages.

This terminology might seem confusing for now, but we’re going to jump into an example dataset which includes how songs moved through the charts in the year 2000. We’re given the date when the song entered the chart (which is always a Saturday) and then the ranking within the charts for the next 76 weeks. This data is in wide format.

billboard

## # A tibble: 317 x 79
##    artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby~ 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+h~ The ~ 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doo~ Kryp~ 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doo~ Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 B~ Wobb~ 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0   Give~ 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Tee~ Danc~ 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliy~ I Do~ 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliy~ Try ~ 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams~ Open~ 2000-08-26      76    76    74    69    68    67    61    58
## # ... with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## #   wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## #   wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## #   wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## #   wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## #   wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>

Each row represents one song, and therefore our observations are single songs and looking at when each song was in the charts. The position in the charts for each week from it’s date of release is represented by a separate variable.

3.2.1 pivot_longer()

But, we might want an observation to represent each week for a track and so we can turn this data from wide to long format using the pivot_longer() function. The data currently has 317 rows and 76 weeks, so the resulting dataset should have 2.4092^{4} rows, since each track will now appear across 76 rows.

billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week_num",
    values_to = "position"
  )

## # A tibble: 24,092 x 5
##    artist track                   date.entered week_num position
##    <chr>  <chr>                   <date>       <chr>       <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1            87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2            82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3            72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4            77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5            87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6            94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7            99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8            NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9            NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10           NA
## # ... with 24,082 more rows

So we used the pivot_longer() function with 4 arguments. Remember that the first argument is our dataset being piped in. Then we’ve given a name to the new variable that takes the previous variable names (names_to), similarly for the values we are extracting (values_to). The other argument is using a tidy helper called starts_with(). Remember the everything() helper from last time when we were using the select() function? This works similarly. It’s saying that the columns we want to pivot are the ones that start with "wk".

This dataset looks much bigger because it has more observations, but it’s still representing the same data. pivot_longer() has a few other tricks that we can do. For example, there are now lot of NA values in the position variable, these represent weeks when the track fell out of the top 100. We could use filter() to get rid of these, or we can tell pivot_longer() to do it for us.

billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week_num",
    values_to = "position",
    values_drop_na = TRUE
  )

## # A tibble: 5,307 x 5
##    artist  track                   date.entered week_num position
##    <chr>   <chr>                   <date>       <chr>       <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1            87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2            82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3            72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4            77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5            87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6            94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7            99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1            91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2            87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3            92
## # ... with 5,297 more rows

We’ve not got much less rows since we’ve gotten rid of the NA values from the position variable. The values_drop_na argument does exactly what the name suggests. Should we drop the NAs or keep them? If we want to work with this dataset though, we’d prefer that week_num column to be a number, at the minute it’s character, which means we can’t do any maths on it. pivot_longer() can get rid of the "wk" part of this and convert it to a number using two more arguments:

billboard_long <- billboard %>%
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week_num",
    values_to = "position",
    values_drop_na = TRUE,
    names_prefix = "wk",
    names_transform = list(week_num = as.integer)
  )
billboard_long

## # A tibble: 5,307 x 5
##    artist  track                   date.entered week_num position
##    <chr>   <chr>                   <date>          <int>    <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26          1       87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26          2       82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26          3       72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26          4       77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26          5       87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26          6       94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26          7       99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02          1       91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02          2       87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02          3       92
## # ... with 5,297 more rows

The names_prefix argument told pivot_longer() to strip out the "wk" part of the week_num variable and then the names_transform applied the as.integer() function to these values. This one function just did quite a lot of work for us!

Bear in mind that the arguments above can be given to the function in any order, as long as they are named.

3.2.2 pivot_wider()

But what if we want to turn the data back? Or, in general, we want to convert a long dataset into a wide one? Well pivot_longer() has a twin called pivot_wider(). The arguments look very similar and so is their usage:

billboard_long %>%
  pivot_wider(
    names_from = "week_num",
    names_prefix = "wk",
    values_from = "position"
  )

## # A tibble: 317 x 68
##    artist track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac  Baby~ 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+h~ The ~ 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doo~ Kryp~ 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doo~ Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 B~ Wobb~ 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0   Give~ 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Tee~ Danc~ 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliy~ I Do~ 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliy~ Try ~ 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams~ Open~ 2000-08-26      76    76    74    69    68    67    61    58
## # ... with 307 more rows, and 57 more variables: wk9 <dbl>, wk10 <dbl>,
## #   wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## #   wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## #   wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## #   wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## #   wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## #   wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## #   wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## #   wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## #   wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## #   wk65 <dbl>

Pretty simple to do this and it gets the data back to how it was previously. Oftentimes it can be quite useful to pivot data around and then return back to it’s original form.

A reason we might want to do this is to edit all of the week data all at once. In the billboard dataset, the position in the charts is based on number of weeks since the track was released. So 2 Pac’s Baby Don’t Cry in the first row, was released a few months before 3 Doors Down’s Kryptonite, but a week after each of their releases they were in chart position 87 and 91 respectively. What if we wanted to know where they were for the actual weeks of the year?

First we’re going to load up another package called lubridate which makes working with dates a lot smoother.

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

And then we can figure out how far into the year each song was released (note that the date.entered is a Saturday and the year 2000 started on a Saturday). To do this, we’ll subtract the 1st January from the date.entered to get the number of days between them and then divide it by 7 to get the number of weeks. By adding the week_num - 1, we can see which week each position is from, and we can overwrite the week_num value with this new value.

billboard_long %>%
  mutate(week_num = as.integer((date.entered - dmy("01/01/2000"))/7) + week_num - 1)

## # A tibble: 5,307 x 5
##    artist  track                   date.entered week_num position
##    <chr>   <chr>                   <date>          <dbl>    <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26          8       87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26          9       82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26         10       72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26         11       77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26         12       87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26         13       94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26         14       99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02         35       91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02         36       87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02         37       92
## # ... with 5,297 more rows

Next we can filter() out the weeks from outside of the year 2000 by getting rid of week_num that are less than 0 or greater than 52, and then by sorting the data by this week_num, we’ll get the columns in the right order when we come to pivot_wider() back around.

billboard_long %>%
  mutate(week_num = as.integer((date.entered - dmy("01/01/2000"))/7) + week_num - 1) %>%
  filter(between(week_num,0,52)) %>%
  arrange(week_num) %>%
  pivot_wider(
    names_from = week_num,
    names_prefix = "wk",
    values_from = position
  )

## # A tibble: 317 x 56
##    artist track date.entered   wk0   wk1   wk2   wk3   wk4   wk5   wk6   wk7
##    <chr>  <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Aguil~ What~ 1999-11-27      13    11     1     1     2     2     3     3
##  2 Amber  Sexu~ 1999-07-17      69    59    58    58    49    44    42    46
##  3 Backs~ Show~ 2000-01-01      74    62    55    25    16    14    12    10
##  4 Blaque Brin~ 1999-10-23       9     9    10     5     5     6     8     9
##  5 Blige~ Deep~ 1999-11-13      64    65    67    63    67    75    85    94
##  6 Blink~ All ~ 1999-12-04      59    51    50    35    26    15     7     6
##  7 Brock~ A Co~ 2000-01-01      93    75    92    NA    NA    NA    NA    NA
##  8 Carey~ Than~ 1999-12-11      50    41    37    26    22    22     2     1
##  9 Count~ Hang~ 1999-11-06      35    32    29    29    28    32    38    48
## 10 Creed  High~ 1999-09-11      61    57    60    61    NA    NA    NA    NA
## # ... with 307 more rows, and 45 more variables: wk8 <dbl>, wk9 <dbl>,
## #   wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## #   wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## #   wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## #   wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## #   wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## #   wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>,
## #   wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>,
## #   wk52 <dbl>

3.3 Two Table Verbs

In the previous lesson we looked at the verbs associated with a single table. However, a lot of the time, our data doesn’t come in a nice and neat single spreadsheet so we might have to combine multiple datasets into a single tibble. There are various ways to do this kind of combining depending on what the data looks like and our intentions with it.

3.3.1 bind_rows()

If we have two tables that are (more or less) the same, we can just stick them together with one on top of the other.

data1 <- tibble(x=1:3,y=c("red","green","blue"))
data2 <- tibble(x=4:6,y=c("orange","purple","yellow"))

data1

## # A tibble: 3 x 2
##       x y    
##   <int> <chr>
## 1     1 red  
## 2     2 green
## 3     3 blue

data2

## # A tibble: 3 x 2
##       x y     
##   <int> <chr> 
## 1     4 orange
## 2     5 purple
## 3     6 yellow

To do this, we just pass both tables to the bind_rows() function which will essentially stack them on top of eachother

bind_rows(data1,data2)

## # A tibble: 6 x 2
##       x y     
##   <int> <chr> 
## 1     1 red   
## 2     2 green 
## 3     3 blue  
## 4     4 orange
## 5     5 purple
## 6     6 yellow

This function can take in as many tables as we want (for now, we’ll just put in two). If we want to bind more together, just pass them as extra arguments

bind_rows(
  data1,
  data2,
  tibble(x=7,y="black"),
  tibble(x=8,y="white"),
)

## # A tibble: 8 x 2
##       x y     
##   <dbl> <chr> 
## 1     1 red   
## 2     2 green 
## 3     3 blue  
## 4     4 orange
## 5     5 purple
## 6     6 yellow
## 7     7 black 
## 8     8 white

This is a very simple procedure, however we can add a bit of complexity if we want to keep track of where our data is coming from. For example, you might record a lot of data and so have a spreadsheet for each participant. We can add a new column that will identify each tibble we are binding

bind_rows(
  set1 = data1,
  set2 = data2,
  .id = "set"
)

## # A tibble: 6 x 3
##   set       x y     
##   <chr> <int> <chr> 
## 1 set1      1 red   
## 2 set1      2 green 
## 3 set1      3 blue  
## 4 set2      4 orange
## 5 set2      5 purple
## 6 set2      6 yellow

bind_rows() even works if your data isn’t the exact same, it’ll just put NA where there was missing data:

bind_rows(
  data1,
  tibble(x=8)
)

## # A tibble: 4 x 2
##       x y    
##   <dbl> <chr>
## 1     1 red  
## 2     2 green
## 3     3 blue 
## 4     8 <NA>

3.3.2 bind_cols()

Just like with select() and filter() which work on columns & rows respectively, bind_rows() has a twin called bind_cols() and this will stick the tables together next to one another.

data3 <- tibble(z=c("cat","dog","mouse"),w=100:102)
bind_cols(data1,data3)

## # A tibble: 3 x 4
##       x y     z         w
##   <int> <chr> <chr> <int>
## 1     1 red   cat     100
## 2     2 green dog     101
## 3     3 blue  mouse   102

3.3.3 joins

The above two methods of combining data don’t do an awful lot. They just stick the data together without changing much. However, we might want to match our data together based on a common identifier. For this, we’re going to use some data on flights leaving New York City, which is stored in a package called nycflights

library(nycflights13)
flights2 <- select(flights,year,month,day,hour,origin,dest,tailnum,carrier)
flights2

## # A tibble: 336,776 x 8
##     year month   day  hour origin dest  tailnum carrier
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
##  1  2013     1     1     5 EWR    IAH   N14228  UA     
##  2  2013     1     1     5 LGA    IAH   N24211  UA     
##  3  2013     1     1     5 JFK    MIA   N619AA  AA     
##  4  2013     1     1     5 JFK    BQN   N804JB  B6     
##  5  2013     1     1     6 LGA    ATL   N668DN  DL     
##  6  2013     1     1     5 EWR    ORD   N39463  UA     
##  7  2013     1     1     6 EWR    FLL   N516JB  B6     
##  8  2013     1     1     6 LGA    IAD   N829AS  EV     
##  9  2013     1     1     6 JFK    MCO   N593JB  B6     
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
## # ... with 336,766 more rows

As well as the actual flight information, this package also contains data about the airlines, which we can see here

airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

How would we go about sticking these together. We can’t just bind_cols() since we want to match them based on the value in carrier. To do this, we can use left_join():

flights2 %>%
  left_join(airlines,
            by="carrier")

## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier name                    
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>                   
##  1  2013     1     1     5 EWR    IAH   N14228  UA      United Air Lines Inc.   
##  2  2013     1     1     5 LGA    IAH   N24211  UA      United Air Lines Inc.   
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      American Airlines Inc.  
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      JetBlue Airways         
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Delta Air Lines Inc.    
##  6  2013     1     1     5 EWR    ORD   N39463  UA      United Air Lines Inc.   
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      JetBlue Airways         
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      ExpressJet Airlines Inc.
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      JetBlue Airways         
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      American Airlines Inc.  
## # ... with 336,766 more rows

As well as supplying the two datasets, we’ve also told R which variable we want to match on. By default, R will just look at which variables appear in both datasets and match on those. And yes, you can match on multiple variables, like so.

flights2 %>%
  left_join(
    select(weather,origin,year,month,day,hour,temp)
  )

## Joining, by = c("year", "month", "day", "hour", "origin")

## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier  temp
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl>
##  1  2013     1     1     5 EWR    IAH   N14228  UA       39.0
##  2  2013     1     1     5 LGA    IAH   N24211  UA       39.9
##  3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0
##  4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0
##  5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9
##  6  2013     1     1     5 EWR    ORD   N39463  UA       39.0
##  7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9
##  8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9
##  9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9
## # ... with 336,766 more rows

Above we used the left_join() function, but there are actually four main types of *_join() functions in R. They each match the data in slightly different ways.

3.3.3.1 full_join()

full_join() will match every instance in your two datasets.

data1 <- tibble(id = 1:4,x=runif(4))
data2 <- tibble(id = 2:7,y=runif(6))
full_join(data1,data2,by="id")

## # A tibble: 7 x 3
##      id      x       y
##   <int>  <dbl>   <dbl>
## 1     1  0.565 NA     
## 2     2  0.140  0.229 
## 3     3  0.254  0.152 
## 4     4  0.154  0.669 
## 5     5 NA      0.0199
## 6     6 NA      0.756 
## 7     7 NA      0.740

The value of id = 5 isn’t present in data1 and so there isn’t a row associated with that in the data and so R just puts an NA in those. Similarly for the other NA values, they’re not present in both datasets. full_join() will return every row from both data sets being joined together.

3.3.3.2 inner_join()

inner_join(data1,data2,by="id")

## # A tibble: 3 x 3
##      id     x     y
##   <int> <dbl> <dbl>
## 1     2 0.140 0.229
## 2     3 0.254 0.152
## 3     4 0.154 0.669

The inner_join() function returns only the rows that are in both datasets. It doesn’t give us any NA, and returns a relatively small dataset.

3.3.3.3 left_join() & right_join()

The left_join() and right_join() functions will keep everything from either the first or the second dataset being supplied.

left_join(data1,data2,by="id")

## # A tibble: 4 x 3
##      id     x      y
##   <int> <dbl>  <dbl>
## 1     1 0.565 NA    
## 2     2 0.140  0.229
## 3     3 0.254  0.152
## 4     4 0.154  0.669

Here, we don’t keep the rows from y that are not in x (i.e. id = 5,6,7).

right_join(data1,data2,by="id")

## # A tibble: 6 x 3
##      id      x      y
##   <int>  <dbl>  <dbl>
## 1     2  0.140 0.229 
## 2     3  0.254 0.152 
## 3     4  0.154 0.669 
## 4     5 NA     0.0199
## 5     6 NA     0.756 
## 6     7 NA     0.740

This time we have everything from y, but not all of the results for x because there isn’t a row for id=1 in the data2 dataset.

3.4 Decisions

Not all tasks in R require us to do maths or logical. Sometimes, we might have to make a choice of what to return based on an input. Here well have a look at some of the options available if we need to make decisions like this

3.4.1 if_else()

The if_else() function takes in three arguments, which should all be either vectors of the same length, or length 1. The first argument is then analysed and it should return a logical vector (TRUEs and FALSEs). Wherever this logical vector is TRUE, it will return the appropriate value from the second argument, and wherever it is FALSE, it will return the appropriate value from the third argument.

Let’s see it in action:

logical_vector <- c(T,T,F,F,T)
true_response <- letters[1:5]
false_response <- letters[22:26]
if_else(
  logical_vector,
  true_response,
  false_response
)

## [1] "a" "b" "x" "y" "e"

So, since our logical vector has TRUE for the first two values, the first two values from the true_response are given here, followed by two from the false_response and then finally the true_response.

This is a very quick way to make a vectorised choice in R. And don’t forget, this function can be used in a mutate() function to edit our datasets

flights2 %>%
  mutate(am_pm = if_else(hour < 12, "am","pm")) %>%
  slice_sample(n=10)

## # A tibble: 10 x 9
##     year month   day  hour origin dest  tailnum carrier am_pm
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>
##  1  2013    10    18    16 LGA    MSP   N944DL  DL      pm   
##  2  2013     9    16    15 LGA    ORF   N724EV  EV      pm   
##  3  2013    12    30    15 LGA    MIA   N3AYAA  AA      pm   
##  4  2013     8    27    21 LGA    DTW   N1EAMQ  MQ      pm   
##  5  2013     2    26     9 JFK    BUF   N599JB  B6      am   
##  6  2013     8     6     6 EWR    IND   N11199  EV      am   
##  7  2013    11    14    10 LGA    CLT   N713UW  US      am   
##  8  2013    10    27    12 EWR    DAY   N13913  EV      pm   
##  9  2013     4    21    18 LGA    XNA   N717MQ  MQ      pm   
## 10  2013     6    16    13 EWR    STL   N12563  EV      pm

3.4.2 recode()

The if_else() function is useful for making a decision between two options, but that situation doesn’t arise very often. Sometimes, we might need to replace a certain value and so the recode() function is useful for that

flights2 %>%
  mutate(month_string = recode(month,
                               `1`="January",
                               `2`="February",
                               `3`="March",
                               `4`="April",
                               `5`="May",
                               `6`="June",
                               `7`="July",
                               `8`="August",
                               `9`="September",
                               `10`="October",
                               `11`="November",
                               `12`="December",
                               )) %>%
  slice_sample(n=10)

## # A tibble: 10 x 9
##     year month   day  hour origin dest  tailnum carrier month_string
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>       
##  1  2013     2     7    16 JFK    CLT   N249JB  B6      February    
##  2  2013     6    17    15 LGA    DFW   N3JRAA  AA      June        
##  3  2013    12    25    10 LGA    CLT   N770UW  US      December    
##  4  2013     5    29    16 EWR    SFO   N38467  UA      May         
##  5  2013     2    20     8 LGA    DEN   N562UA  UA      February    
##  6  2013    12    12    16 EWR    RDU   N13978  EV      December    
##  7  2013     4    15     5 LGA    IAH   N578UA  UA      April       
##  8  2013     7     2    18 EWR    SFO   N509UA  UA      July        
##  9  2013     4    21     6 LGA    MKE   N410WN  WN      April       
## 10  2013     6    25    18 JFK    PBI   N559JB  B6      June

3.4.3 case_when()

However, the recode() function isn’t too flexible, it’ll only look at a single variable and then decide what to do based on that value. We might want to check something more complicated, and so for things like that we can use the case_when() function, which has a unique syntax.

flights2 %>%
  mutate(season = case_when(
    month < 2 | month == 12 ~ "Winter",
    month < 5 ~ "Spring",
    month < 8 ~ "Summer",
    T ~ "Autumn"
  ))

## # A tibble: 336,776 x 9
##     year month   day  hour origin dest  tailnum carrier season
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr> 
##  1  2013     1     1     5 EWR    IAH   N14228  UA      Winter
##  2  2013     1     1     5 LGA    IAH   N24211  UA      Winter
##  3  2013     1     1     5 JFK    MIA   N619AA  AA      Winter
##  4  2013     1     1     5 JFK    BQN   N804JB  B6      Winter
##  5  2013     1     1     6 LGA    ATL   N668DN  DL      Winter
##  6  2013     1     1     5 EWR    ORD   N39463  UA      Winter
##  7  2013     1     1     6 EWR    FLL   N516JB  B6      Winter
##  8  2013     1     1     6 LGA    IAD   N829AS  EV      Winter
##  9  2013     1     1     6 JFK    MCO   N593JB  B6      Winter
## 10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Winter
## # ... with 336,766 more rows