3 Enter The Tidyverse: What’s Next?
3.1 Refresher
Previously, we learned what a tibble
is and how it stores data in observations/rows and variables/columns. We also learned about 7 fundamental data manipulation functions to allow us to edit these tibbles.
- filter() - selects observations based on a criteria/predicate
- slice() - selects obersevations based on number
- arrange() - re-arranges rows based on a variable
- select() - picks certain variables
- mutate() - changes variables or creates new ones
- summarise() - aggregates variables based on some functions
- group_by() - groups data sets by variables
3.2 Pivoting
In general, datasets come in two forms, long or wide. Long data has many rows and fewer columns, and so it looks longer (vertically) if shown in a spreadsheet. Wide data has more columns and fewer rows (hence, it is wider). Data can be represented in both ways and each has advantages.
This terminology might seem confusing for now, but we’re going to jump into an example dataset which includes how songs moved through the charts in the year 2000. We’re given the date when the song entered the chart (which is always a Saturday) and then the ranking within the charts for the next 76 weeks. This data is in wide format.
## # A tibble: 317 x 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby~ 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+h~ The ~ 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doo~ Kryp~ 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doo~ Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 B~ Wobb~ 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give~ 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Tee~ Danc~ 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliy~ I Do~ 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliy~ Try ~ 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams~ Open~ 2000-08-26 76 76 74 69 68 67 61 58
## # ... with 307 more rows, and 68 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## # wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## # wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## # wk65 <dbl>, wk66 <lgl>, wk67 <lgl>, wk68 <lgl>, wk69 <lgl>, wk70 <lgl>,
## # wk71 <lgl>, wk72 <lgl>, wk73 <lgl>, wk74 <lgl>, wk75 <lgl>, wk76 <lgl>
Each row represents one song, and therefore our observations are single songs and looking at when each song was in the charts. The position in the charts for each week from it’s date of release is represented by a separate variable.
3.2.1 pivot_longer()
But, we might want an observation to represent each week for a track and so we can turn this data from wide to long format using the pivot_longer()
function. The data currently has 317 rows and 76 weeks, so the resulting dataset should have 2.4092^{4} rows, since each track will now appear across 76 rows.
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week_num",
values_to = "position"
)
## # A tibble: 24,092 x 5
## artist track date.entered week_num position
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ... with 24,082 more rows
So we used the pivot_longer()
function with 4 arguments. Remember that the first argument is our dataset being piped in. Then we’ve given a name to the new variable that takes the previous variable names (names_to
), similarly for the values we are extracting (values_to
). The other argument is using a tidy helper
called starts_with()
. Remember the everything()
helper from last time when we were using the select()
function? This works similarly. It’s saying that the columns we want to pivot are the ones that start with "wk"
.
This dataset looks much bigger because it has more observations, but it’s still representing the same data. pivot_longer()
has a few other tricks that we can do. For example, there are now lot of NA
values in the position
variable, these represent weeks when the track fell out of the top 100. We could use filter()
to get rid of these, or we can tell pivot_longer()
to do it for us.
billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week_num",
values_to = "position",
values_drop_na = TRUE
)
## # A tibble: 5,307 x 5
## artist track date.entered week_num position
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ... with 5,297 more rows
We’ve not got much less rows since we’ve gotten rid of the NA
values from the position
variable. The values_drop_na
argument does exactly what the name suggests. Should we drop the NA
s or keep them? If we want to work with this dataset though, we’d prefer that week_num
column to be a number, at the minute it’s character, which means we can’t do any maths on it. pivot_longer()
can get rid of the "wk"
part of this and convert it to a number using two more arguments:
billboard_long <- billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week_num",
values_to = "position",
values_drop_na = TRUE,
names_prefix = "wk",
names_transform = list(week_num = as.integer)
)
billboard_long
## # A tibble: 5,307 x 5
## artist track date.entered week_num position
## <chr> <chr> <date> <int> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
## # ... with 5,297 more rows
The names_prefix
argument told pivot_longer()
to strip out the "wk"
part of the week_num
variable and then the names_transform
applied the as.integer()
function to these values. This one function just did quite a lot of work for us!
Bear in mind that the arguments above can be given to the function in any order, as long as they are named.
3.2.2 pivot_wider()
But what if we want to turn the data back? Or, in general, we want to convert a long dataset into a wide one? Well pivot_longer()
has a twin called pivot_wider()
. The arguments look very similar and so is their usage:
billboard_long %>%
pivot_wider(
names_from = "week_num",
names_prefix = "wk",
values_from = "position"
)
## # A tibble: 317 x 68
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby~ 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+h~ The ~ 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doo~ Kryp~ 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doo~ Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 B~ Wobb~ 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give~ 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Tee~ Danc~ 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliy~ I Do~ 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliy~ Try ~ 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams~ Open~ 2000-08-26 76 76 74 69 68 67 61 58
## # ... with 307 more rows, and 57 more variables: wk9 <dbl>, wk10 <dbl>,
## # wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>,
## # wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>,
## # wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>,
## # wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>,
## # wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>,
## # wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>,
## # wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>, wk52 <dbl>,
## # wk53 <dbl>, wk54 <dbl>, wk55 <dbl>, wk56 <dbl>, wk57 <dbl>, wk58 <dbl>,
## # wk59 <dbl>, wk60 <dbl>, wk61 <dbl>, wk62 <dbl>, wk63 <dbl>, wk64 <dbl>,
## # wk65 <dbl>
Pretty simple to do this and it gets the data back to how it was previously. Oftentimes it can be quite useful to pivot data around and then return back to it’s original form.
A reason we might want to do this is to edit all of the week data all at once. In the billboard
dataset, the position in the charts is based on number of weeks since the track was released. So 2 Pac’s Baby Don’t Cry in the first row, was released a few months before 3 Doors Down’s Kryptonite, but a week after each of their releases they were in chart position 87 and 91 respectively. What if we wanted to know where they were for the actual weeks of the year?
First we’re going to load up another package called lubridate
which makes working with dates a lot smoother.
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
And then we can figure out how far into the year each song was released (note that the date.entered
is a Saturday and the year 2000 started on a Saturday). To do this, we’ll subtract the 1st January from the date.entered
to get the number of days between them and then divide it by 7 to get the number of weeks. By adding the week_num - 1
, we can see which week each position is from, and we can overwrite the week_num
value with this new value.
billboard_long %>%
mutate(week_num = as.integer((date.entered - dmy("01/01/2000"))/7) + week_num - 1)
## # A tibble: 5,307 x 5
## artist track date.entered week_num position
## <chr> <chr> <date> <dbl> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 8 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 9 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 10 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 11 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 12 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 13 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 14 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 35 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 36 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 37 92
## # ... with 5,297 more rows
Next we can filter()
out the weeks from outside of the year 2000 by getting rid of week_num
that are less than 0 or greater than 52, and then by sorting the data by this week_num
, we’ll get the columns in the right order when we come to pivot_wider()
back around.
billboard_long %>%
mutate(week_num = as.integer((date.entered - dmy("01/01/2000"))/7) + week_num - 1) %>%
filter(between(week_num,0,52)) %>%
arrange(week_num) %>%
pivot_wider(
names_from = week_num,
names_prefix = "wk",
values_from = position
)
## # A tibble: 317 x 56
## artist track date.entered wk0 wk1 wk2 wk3 wk4 wk5 wk6 wk7
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Aguil~ What~ 1999-11-27 13 11 1 1 2 2 3 3
## 2 Amber Sexu~ 1999-07-17 69 59 58 58 49 44 42 46
## 3 Backs~ Show~ 2000-01-01 74 62 55 25 16 14 12 10
## 4 Blaque Brin~ 1999-10-23 9 9 10 5 5 6 8 9
## 5 Blige~ Deep~ 1999-11-13 64 65 67 63 67 75 85 94
## 6 Blink~ All ~ 1999-12-04 59 51 50 35 26 15 7 6
## 7 Brock~ A Co~ 2000-01-01 93 75 92 NA NA NA NA NA
## 8 Carey~ Than~ 1999-12-11 50 41 37 26 22 22 2 1
## 9 Count~ Hang~ 1999-11-06 35 32 29 29 28 32 38 48
## 10 Creed High~ 1999-09-11 61 57 60 61 NA NA NA NA
## # ... with 307 more rows, and 45 more variables: wk8 <dbl>, wk9 <dbl>,
## # wk10 <dbl>, wk11 <dbl>, wk12 <dbl>, wk13 <dbl>, wk14 <dbl>, wk15 <dbl>,
## # wk16 <dbl>, wk17 <dbl>, wk18 <dbl>, wk19 <dbl>, wk20 <dbl>, wk21 <dbl>,
## # wk22 <dbl>, wk23 <dbl>, wk24 <dbl>, wk25 <dbl>, wk26 <dbl>, wk27 <dbl>,
## # wk28 <dbl>, wk29 <dbl>, wk30 <dbl>, wk31 <dbl>, wk32 <dbl>, wk33 <dbl>,
## # wk34 <dbl>, wk35 <dbl>, wk36 <dbl>, wk37 <dbl>, wk38 <dbl>, wk39 <dbl>,
## # wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, wk43 <dbl>, wk44 <dbl>, wk45 <dbl>,
## # wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, wk49 <dbl>, wk50 <dbl>, wk51 <dbl>,
## # wk52 <dbl>
3.3 Two Table Verbs
In the previous lesson we looked at the verbs associated with a single table. However, a lot of the time, our data doesn’t come in a nice and neat single spreadsheet so we might have to combine multiple datasets into a single tibble. There are various ways to do this kind of combining depending on what the data looks like and our intentions with it.
3.3.1 bind_rows()
If we have two tables that are (more or less) the same, we can just stick them together with one on top of the other.
data1 <- tibble(x=1:3,y=c("red","green","blue"))
data2 <- tibble(x=4:6,y=c("orange","purple","yellow"))
data1
## # A tibble: 3 x 2
## x y
## <int> <chr>
## 1 1 red
## 2 2 green
## 3 3 blue
## # A tibble: 3 x 2
## x y
## <int> <chr>
## 1 4 orange
## 2 5 purple
## 3 6 yellow
To do this, we just pass both tables to the bind_rows()
function which will essentially stack them on top of eachother
## # A tibble: 6 x 2
## x y
## <int> <chr>
## 1 1 red
## 2 2 green
## 3 3 blue
## 4 4 orange
## 5 5 purple
## 6 6 yellow
This function can take in as many tables as we want (for now, we’ll just put in two). If we want to bind more together, just pass them as extra arguments
## # A tibble: 8 x 2
## x y
## <dbl> <chr>
## 1 1 red
## 2 2 green
## 3 3 blue
## 4 4 orange
## 5 5 purple
## 6 6 yellow
## 7 7 black
## 8 8 white
This is a very simple procedure, however we can add a bit of complexity if we want to keep track of where our data is coming from. For example, you might record a lot of data and so have a spreadsheet for each participant. We can add a new column that will identify each tibble we are binding
## # A tibble: 6 x 3
## set x y
## <chr> <int> <chr>
## 1 set1 1 red
## 2 set1 2 green
## 3 set1 3 blue
## 4 set2 4 orange
## 5 set2 5 purple
## 6 set2 6 yellow
bind_rows()
even works if your data isn’t the exact same, it’ll just put NA
where there was missing data:
## # A tibble: 4 x 2
## x y
## <dbl> <chr>
## 1 1 red
## 2 2 green
## 3 3 blue
## 4 8 <NA>
3.3.2 bind_cols()
Just like with select()
and filter()
which work on columns & rows respectively, bind_rows()
has a twin called bind_cols()
and this will stick the tables together next to one another.
## # A tibble: 3 x 4
## x y z w
## <int> <chr> <chr> <int>
## 1 1 red cat 100
## 2 2 green dog 101
## 3 3 blue mouse 102
3.3.3 joins
The above two methods of combining data don’t do an awful lot. They just stick the data together without changing much. However, we might want to match our data together based on a common identifier. For this, we’re going to use some data on flights leaving New York City, which is stored in a package called nycflights
library(nycflights13)
flights2 <- select(flights,year,month,day,hour,origin,dest,tailnum,carrier)
flights2
## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # ... with 336,766 more rows
As well as the actual flight information, this package also contains data about the airlines, which we can see here
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
How would we go about sticking these together. We can’t just bind_cols()
since we want to match them based on the value in carrier
. To do this, we can use left_join()
:
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 LGA IAH N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 JFK MIA N619AA AA American Airlines Inc.
## 4 2013 1 1 5 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 1 1 6 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 2013 1 1 5 EWR ORD N39463 UA United Air Lines Inc.
## 7 2013 1 1 6 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 1 1 6 LGA IAD N829AS EV ExpressJet Airlines Inc.
## 9 2013 1 1 6 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 1 1 6 LGA ORD N3ALAA AA American Airlines Inc.
## # ... with 336,766 more rows
As well as supplying the two datasets, we’ve also told R which variable we want to match on. By default, R will just look at which variables appear in both datasets and match on those. And yes, you can match on multiple variables, like so.
## Joining, by = c("year", "month", "day", "hour", "origin")
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier temp
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9
## # ... with 336,766 more rows
Above we used the left_join()
function, but there are actually four main types of *_join()
functions in R. They each match the data in slightly different ways.
3.3.3.1 full_join()
full_join()
will match every instance in your two datasets.
data1 <- tibble(id = 1:4,x=runif(4))
data2 <- tibble(id = 2:7,y=runif(6))
full_join(data1,data2,by="id")
## # A tibble: 7 x 3
## id x y
## <int> <dbl> <dbl>
## 1 1 0.565 NA
## 2 2 0.140 0.229
## 3 3 0.254 0.152
## 4 4 0.154 0.669
## 5 5 NA 0.0199
## 6 6 NA 0.756
## 7 7 NA 0.740
The value of id = 5
isn’t present in data1
and so there isn’t a row associated with that in the data and so R just puts an NA
in those. Similarly for the other NA
values, they’re not present in both datasets. full_join()
will return every row from both data sets being joined together.
3.3.3.2 inner_join()
## # A tibble: 3 x 3
## id x y
## <int> <dbl> <dbl>
## 1 2 0.140 0.229
## 2 3 0.254 0.152
## 3 4 0.154 0.669
The inner_join()
function returns only the rows that are in both datasets. It doesn’t give us any NA
, and returns a relatively small dataset.
3.3.3.3 left_join() & right_join()
The left_join()
and right_join()
functions will keep everything from either the first or the second dataset being supplied.
## # A tibble: 4 x 3
## id x y
## <int> <dbl> <dbl>
## 1 1 0.565 NA
## 2 2 0.140 0.229
## 3 3 0.254 0.152
## 4 4 0.154 0.669
Here, we don’t keep the rows from y
that are not in x
(i.e. id = 5,6,7
).
## # A tibble: 6 x 3
## id x y
## <int> <dbl> <dbl>
## 1 2 0.140 0.229
## 2 3 0.254 0.152
## 3 4 0.154 0.669
## 4 5 NA 0.0199
## 5 6 NA 0.756
## 6 7 NA 0.740
This time we have everything from y
, but not all of the results for x
because there isn’t a row for id=1
in the data2
dataset.
3.4 Decisions
Not all tasks in R require us to do maths or logical. Sometimes, we might have to make a choice of what to return based on an input. Here well have a look at some of the options available if we need to make decisions like this
3.4.1 if_else()
The if_else()
function takes in three arguments, which should all be either vectors of the same length, or length 1. The first argument is then analysed and it should return a logical vector (TRUE
s and FALSE
s). Wherever this logical vector is TRUE
, it will return the appropriate value from the second argument, and wherever it is FALSE
, it will return the appropriate value from the third argument.
Let’s see it in action:
logical_vector <- c(T,T,F,F,T)
true_response <- letters[1:5]
false_response <- letters[22:26]
if_else(
logical_vector,
true_response,
false_response
)
## [1] "a" "b" "x" "y" "e"
So, since our logical vector has TRUE
for the first two values, the first two values from the true_response
are given here, followed by two from the false_response
and then finally the true_response
.
This is a very quick way to make a vectorised choice in R. And don’t forget, this function can be used in a mutate()
function to edit our datasets
## # A tibble: 10 x 9
## year month day hour origin dest tailnum carrier am_pm
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 10 18 16 LGA MSP N944DL DL pm
## 2 2013 9 16 15 LGA ORF N724EV EV pm
## 3 2013 12 30 15 LGA MIA N3AYAA AA pm
## 4 2013 8 27 21 LGA DTW N1EAMQ MQ pm
## 5 2013 2 26 9 JFK BUF N599JB B6 am
## 6 2013 8 6 6 EWR IND N11199 EV am
## 7 2013 11 14 10 LGA CLT N713UW US am
## 8 2013 10 27 12 EWR DAY N13913 EV pm
## 9 2013 4 21 18 LGA XNA N717MQ MQ pm
## 10 2013 6 16 13 EWR STL N12563 EV pm
3.4.2 recode()
The if_else()
function is useful for making a decision between two options, but that situation doesn’t arise very often. Sometimes, we might need to replace a certain value and so the recode()
function is useful for that
flights2 %>%
mutate(month_string = recode(month,
`1`="January",
`2`="February",
`3`="March",
`4`="April",
`5`="May",
`6`="June",
`7`="July",
`8`="August",
`9`="September",
`10`="October",
`11`="November",
`12`="December",
)) %>%
slice_sample(n=10)
## # A tibble: 10 x 9
## year month day hour origin dest tailnum carrier month_string
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 2 7 16 JFK CLT N249JB B6 February
## 2 2013 6 17 15 LGA DFW N3JRAA AA June
## 3 2013 12 25 10 LGA CLT N770UW US December
## 4 2013 5 29 16 EWR SFO N38467 UA May
## 5 2013 2 20 8 LGA DEN N562UA UA February
## 6 2013 12 12 16 EWR RDU N13978 EV December
## 7 2013 4 15 5 LGA IAH N578UA UA April
## 8 2013 7 2 18 EWR SFO N509UA UA July
## 9 2013 4 21 6 LGA MKE N410WN WN April
## 10 2013 6 25 18 JFK PBI N559JB B6 June
3.4.3 case_when()
However, the recode()
function isn’t too flexible, it’ll only look at a single variable and then decide what to do based on that value. We might want to check something more complicated, and so for things like that we can use the case_when()
function, which has a unique syntax.
flights2 %>%
mutate(season = case_when(
month < 2 | month == 12 ~ "Winter",
month < 5 ~ "Spring",
month < 8 ~ "Summer",
T ~ "Autumn"
))
## # A tibble: 336,776 x 9
## year month day hour origin dest tailnum carrier season
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA Winter
## 2 2013 1 1 5 LGA IAH N24211 UA Winter
## 3 2013 1 1 5 JFK MIA N619AA AA Winter
## 4 2013 1 1 5 JFK BQN N804JB B6 Winter
## 5 2013 1 1 6 LGA ATL N668DN DL Winter
## 6 2013 1 1 5 EWR ORD N39463 UA Winter
## 7 2013 1 1 6 EWR FLL N516JB B6 Winter
## 8 2013 1 1 6 LGA IAD N829AS EV Winter
## 9 2013 1 1 6 JFK MCO N593JB B6 Winter
## 10 2013 1 1 6 LGA ORD N3ALAA AA Winter
## # ... with 336,766 more rows