Google Data Analytics Capstone : Complete Case Study (My very First Project with Large Dataset)

Cyclistic’s Historical Trip using Divvy Datasets of First Quarter of 2019 and 2020

Since I decided to use R for Data Exploration, Cleaning, Analyzing, Visualization and Interpretation, I need to install some R packages to work with. Which i did during the learning phase. I can install packages using install.packages(dplyr) and install.packages(ggplot2) and load them using library(dplyr) and library(ggplot2) functions to use in the session.

Setting up RMarkdown with Second Chunk of Codes, (First one is used for heading and information above.) This is to load ‘dplyr’ and ‘ggplot2’ in RMarkdown environment.

So far I don’t know how the dataset looks like. I decided to use RMarkdown to view the dataset and also be able to keep track of entire data cleaning and visualization process. I just finished asking questions and understanding business needs better even before seeing the data and information it contains. With the clear understandin of the objectives and requirements, I come uo with three major questions.

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence casual riders to become members?

Loading Dataset for 2019

I already downloaded Divvy_Trips_2019/2020_q1.csv datasets, copied into main RDirectory extracted .zip file and renamed them as Divy2019 and Divy2020, which is located at the main folder. Lets import and rename it as ‘Divy2019’ in RMarkdown.

Divy2019 <- read.csv("Divy2019.csv")

Viewing the summary of Datasets for First Quarter of 2019

summary(Divy2019)
##     trip_id          start_time          end_time             bikeid    
##  Min.   :21742443   Length:365069      Length:365069      Min.   :   1  
##  1st Qu.:21848765   Class :character   Class :character   1st Qu.:1777  
##  Median :21961829   Mode  :character   Mode  :character   Median :3489  
##  Mean   :21960872                                         Mean   :3429  
##  3rd Qu.:22071823                                         3rd Qu.:5157  
##  Max.   :22178528                                         Max.   :6471  
##                                                                         
##  tripduration       from_station_id from_station_name  to_station_id  
##  Length:365069      Min.   :  2.0   Length:365069      Min.   :  2.0  
##  Class :character   1st Qu.: 76.0   Class :character   1st Qu.: 76.0  
##  Mode  :character   Median :170.0   Mode  :character   Median :168.0  
##                     Mean   :198.1                      Mean   :198.6  
##                     3rd Qu.:287.0                      3rd Qu.:287.0  
##                     Max.   :665.0                      Max.   :665.0  
##                                                                       
##  to_station_name      usertype            gender            birthyear    
##  Length:365069      Length:365069      Length:365069      Min.   :1900   
##  Class :character   Class :character   Class :character   1st Qu.:1975   
##  Mode  :character   Mode  :character   Mode  :character   Median :1985   
##                                                           Mean   :1982   
##                                                           3rd Qu.:1990   
##                                                           Max.   :2003   
##                                                           NA's   :18023

Loading Dataset for 2020

Divy2020 <- read.csv("Divy2020.csv")

Viewing the summary of Datasets for First Quarter of 2020

summary(Divy2020)
##    ride_id          rideable_type       started_at          ended_at        
##  Length:426887      Length:426887      Length:426887      Length:426887     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  start_station_name start_station_id end_station_name   end_station_id 
##  Length:426887      Min.   :  2.0    Length:426887      Min.   :  2.0  
##  Class :character   1st Qu.: 77.0    Class :character   1st Qu.: 77.0  
##  Mode  :character   Median :176.0    Mode  :character   Median :175.0  
##                     Mean   :209.8                       Mean   :209.3  
##                     3rd Qu.:298.0                       3rd Qu.:297.0  
##                     Max.   :675.0                       Max.   :675.0  
##                                                         NA's   :1      
##    start_lat       start_lng         end_lat         end_lng      
##  Min.   :41.74   Min.   :-87.77   Min.   :41.74   Min.   :-87.77  
##  1st Qu.:41.88   1st Qu.:-87.66   1st Qu.:41.88   1st Qu.:-87.66  
##  Median :41.89   Median :-87.64   Median :41.89   Median :-87.64  
##  Mean   :41.90   Mean   :-87.64   Mean   :41.90   Mean   :-87.64  
##  3rd Qu.:41.92   3rd Qu.:-87.63   3rd Qu.:41.92   3rd Qu.:-87.63  
##  Max.   :42.06   Max.   :-87.55   Max.   :42.06   Max.   :-87.55  
##                                   NA's   :1       NA's   :1       
##  member_casual     
##  Length:426887     
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Observing the glimpses of Divvy_Trips_2019_Q1 public dataset for 2019 Q1 and Divvy_Trips_2020_q1 for 2020 Q1, i think we can find values like trip length, days of the week and distance for further analysis. Then we can work on cleaning process like convert columns to represent correct data type, rename, add and delete columns to be consistent with data from another year. Some incomplete rows probably needs to be excluded in analysis. I am assigned with the First Question to analyze the bike use patterns by casual riders and members. Observing the data from 2019, the column name tripduration is shown in seconds and doesn’t look consistant, so finding out new column name trip_time with the difference from start_time and end_time.

Data Cleaning (2019)

First let’s see if we can find existing tripduration in decending order and be able to view the higher second amount first.

Divy2019 <- Divy2019 %>% arrange(desc(tripduration))

Since it is still showing data between 999.0 and 1.000 only, i tried to convert seconds into minutes. Which endup creating an error.

Create whole new column ‘trip_time’ in minutes from the difference of start_time and end_time column of the Table Divy2019 using mutate function. To use the function we must make sure both data type is formatted as date, which is done using the code ’as.POSIXct and must have same format as “%Y-%m-%d %H:%M:%S”.

Divy2019 <- Divy2019 %>%
  mutate(
    start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
    end_time = as.POSIXct(end_time, format = "%Y-%m-%d %H:%M:%S"),
    trip_time = as.numeric(difftime(end_time, start_time, units = "mins"))
  )

Now to view three columns start_time, trip_time and usertype using head function. .

head(Divy2019 %>%
  select(start_time, trip_time, usertype))
##            start_time trip_time   usertype
## 1 2019-01-02 06:54:00     16.65 Subscriber
## 2 2019-01-03 07:12:47     16.65 Subscriber
## 3 2019-01-03 14:14:15     16.65 Subscriber
## 4 2019-01-04 16:59:58     16.65 Subscriber
## 5 2019-01-06 00:53:23     16.65 Subscriber
## 6 2019-01-08 08:31:25     16.65 Subscriber

lets change the column ‘usertpe’ to ‘member_casual’ to make consistent with the column from 2020.

Divy2019 <- Divy2019 %>%
  rename(member_casual = usertype)

Now changing all Customer to casual

Divy2019 <- Divy2019 %>%
  mutate(member_casual = ifelse(member_casual == "Customer", "casual",
  member_casual))

and all Subscriber to member to match with the data from 2020.

Divy2019 <- Divy2019 %>% 
  mutate(member_casual = ifelse(member_casual == "Subscriber", "member",
  member_casual))

Now we would like to find the day of the week the trip was started. Lets use mutate functions to get the information.

Divy2019 <- Divy2019 %>%
  mutate(
    start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
    start_day = weekdays(start_time)
  )

I observed some discrepancies in other column name for 2019 and 2020, lets change the column name trip_id in 2019 to ride_id same way like 2020 for consistency. Lets change other columns in 2019 to match with those at 2020.

Divy2019 <- Divy2019 %>%
  rename(
    ride_id = trip_id,
    started_at = start_time,
    ended_at = end_time,
    start_station_name = from_station_name,
    end_station_name = to_station_name,
    start_station_id = from_station_id,
    end_station_id = to_station_id,
  )

Exclude time part from column start_time, (only shows the date)

Divy2019 <- Divy2019 %>%
  mutate(start_date = as.Date(started_at))

Remove bikeid from Divy2019 because there is no same information in Divy2020

Divy2019 <- Divy2019 %>%
  select(-bikeid)

Remove tripduration also, because we already calculated new trip_time in minutes.

Divy2019 <- Divy2019 %>%
  select(-tripduration)

Now let’s see count, sum and average trip_time by member type.

trip_summary <- Divy2019 %>%
  group_by(member_casual) %>%
  summarise(total_trip_time = sum(trip_time, na.rm = TRUE),
            trip_number = n(),
            average_trip_time = mean(trip_time, na.rm = TRUE),
            )
print(trip_summary)
## # A tibble: 2 × 4
##   member_casual total_trip_time trip_number average_trip_time
##   <chr>                   <dbl>       <int>             <dbl>
## 1 casual               1434461.       23163              61.9
## 2 member               4749456.      341906              13.9

remove trip_time decimals to 2 digit.

Divy2019 <- Divy2019 %>%
  mutate(trip_time = round(trip_time, 2))

Data Cleaning (2020)

Looking at the Data for 2020 Q1, there is some errors, inconsistencies and missing values that we need to fix before analyzing. Divy2020 needs tripduration calculated in minutes based on started_at and ended_at time. Calculate the start of the weekday for further analysis. Lets begin with calculation of trip_time in minutes. Calculation of trip_time in minutes, like above in 2019.

Divy2020 <- Divy2020 %>%
  mutate(
    started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),
    ended_at = as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S"),
    trip_time = as.numeric(difftime(ended_at, started_at, units = "mins"))
  )

Now lets remove the negative and values less than 1 minute.

Divy2020 <- Divy2020 %>%
  filter(!is.na(trip_time) & trip_time >= 1)

Limit trip_time to 2 digits after decimals.

Divy2020 <- Divy2020 %>%
  mutate(trip_time = round(trip_time, 2))

Now we need to find the day of the week trip was started, lets find out by using some functions. as.PPSIXct function is used to assign the value as date and time for calculation.

Divy2020 <- Divy2020 %>%
  mutate(
    started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),
    start_day = weekdays(started_at)
  )

Find only the date from started_at column and remove time part.

Divy2020 <- Divy2020 %>%
  mutate(start_date = as.Date(started_at))

Lets remove the few columns that we will not need.

Divy2020 <- Divy2020 %>%
  select(-rideable_type)

It would be cool if we can also find the distance in miles based on Longitude and Lattitude of the start and end stations. Found out we need to install and load library(geosphere) for the calculation. SORRY !!! I was not able to show the firs part of code (to convert latitude, longitude into miles) due to some error on Knit process1. But here is the result with Miles.

head(Divy2020 %>%
  select(start_lat, start_lng, end_lat, end_lng))
##   start_lat start_lng end_lat  end_lng
## 1   41.9665  -87.6884 41.9671 -87.6674
## 2   41.9616  -87.6660 41.9542 -87.6644
## 3   41.9401  -87.6455 41.9402 -87.6530
## 4   41.8846  -87.6319 41.8918 -87.6206
## 5   41.8856  -87.6418 41.8899 -87.6343
## 6   41.8899  -87.6343 41.8846 -87.6446

Limit two digits after decimal. I did this but unable to show the code, probably linked to how i was unable to show the distance based on the start and end longitude, latitude.

Let’s try finding distance for the Divy2019 but this time with the physical address.

This looks like a little too advance requiring APIs and plugins.

Now to calculate the distance! Sorry!!! i have to practice little more to get to this point!

Visualization

I tried to combine the two datasets first but got the error on rider_id types, one is integer and another is character. Lets chance Divy2019 rider_id to char.

Divy2019$ride_id <- as.character(Divy2019$ride_id)

Keeping the business needs in mind to solve the problem, lets find the insights of data. First of all lets combine both datasets.

Divy2019 <- Divy2019 %>% mutate(dataset = "Divy2019")
Divy2020 <- Divy2020 %>% mutate(dataset = "Divy2020")

Creating a new dataset called DivyCombined with both datasets together.

DivyCombined <- bind_rows(Divy2019, Divy2020)

Create another table, DivyTripsDays with on days of the week and trip_time by member type!!

DivyTripDays <- DivyCombined %>%
  group_by(start_day, member_casual) %>%
  summarise(
    total_trip_time = sum(trip_time, na.rm = TRUE),
    max_trip_time = max(trip_time, na.rm = TRUE),
    avg_trip_time = mean(trip_time, na.rm = TRUE),  
    count_trips = n()
  ) %>%
  arrange(member_casual)
## `summarise()` has grouped output by 'start_day'. You can override using the
## `.groups` argument.
print(DivyTripDays)
## # A tibble: 14 × 6
## # Groups:   start_day [7]
##    start_day member_casual total_trip_time max_trip_time avg_trip_time
##    <chr>     <chr>                   <dbl>         <dbl>         <dbl>
##  1 Friday    casual                813170.       117263.         102. 
##  2 Monday    casual                442705.        39328.          79.3
##  3 Saturday  casual               1110728.        65602.          82.5
##  4 Sunday    casual               1572698.       143877.          84.6
##  5 Thursday  casual               1006545.       177140.         141. 
##  6 Tuesday   casual                555723.        69445.          76.2
##  7 Wednesday casual                574220.        74703.          75.0
##  8 Friday    member               1528997.        65448.          13.3
##  9 Monday    member               1513081.        94103.          13.8
## 10 Saturday  member                964068.        79158.          16.3
## 11 Sunday    member                975177.        43215.          16.3
## 12 Thursday  member               1475820.        12659.          11.8
## 13 Tuesday   member               1640952.       101607.          12.9
## 14 Wednesday member               1446324.        11097.          11.9
## # ℹ 1 more variable: count_trips <int>

Lets create first visuals as bar chart from the DivyTripDats table we just created.

custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = total_trip_time, fill = member_casual)) +
  geom_bar(stat = "identity", fun = "mean", position = "dodge") +
  scale_y_continuous(labels = scales::label_number(scale = 1)) +
  scale_fill_manual(values = custom_colors) +
  labs(title = "Trip Duration by Member Type",
       x = "Start Day of the Trip",
       y = "Total Trip Duration in Minutes",
       fill = "Member Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`

We can also check, if there is any relationship between membership type and daverage trip time in minutes.

custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = avg_trip_time, fill = member_casual)) +
  geom_bar(stat = "identity", fun = "mean", position = "dodge") +
  scale_y_continuous(labels = scales::label_number(scale = 1)) +
  scale_fill_manual(values = custom_colors) +
  labs(title = "Average Trip Time by Member Type",
       x = "Start Day of the Trip",
       y = "Average Trips Time in Minutes",
       fill = "Member Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`

We can also check, if there is any relationship between membership type and Number of trips they complete.

custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = count_trips, fill = member_casual)) +
  geom_bar(stat = "identity", fun = "mean", position = "dodge") +
  scale_y_continuous(labels = scales::label_number(scale = 1)) + 
  scale_fill_manual(values = custom_colors) +
  labs(title = "Number of Trip by Member Type",
       x = "Start Day of the Trip",
       y = "Number of Trips",
       fill = "Member Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`

Like to observe how the member and casual riders use bicycle from and to locations. Longer the distance the bicycle used, probably less chance the bike can be used by others.

Viewing the line chart based on customer type.

ggplot(DivyTripDays, aes(x = start_day, y = count_trips, color = member_casual, group = member_casual)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  scale_color_manual(values = c("member" = "#7bebd8", "casual" = "#8335e5")) +
  labs(title = "Trip Counts Over Days by Member Type",
       x = "Start Day",
       y = "Total Trips",
       color = "Member Type") +
  theme_minimal() 
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Summary <- DivyCombined %>%
  group_by(member_casual) %>%
  summarise(average_time = mean(trip_time, na.rm = TRUE))

ggplot(Summary, aes(x = "", y = average_time, fill = member_casual)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Average Trip Time vs Member Type",
       x = "",
       y = "",
       fill = "Member Type") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  scale_fill_manual(values = c("member" = "#7bebd8", "casual" = "#8335e5"))

Now to explore more data stories. Lets find Count of rides and sum of trip_time by member type!!

DivyTripMember <- DivyCombined %>%
  group_by(member_casual) %>%
  summarise(
    trip_count = n(),
    total_trip_time = sum(trip_time, na.rm = TRUE),
    maxl_trip_time = max(trip_time, na.rm = TRUE),
    average_trip_time = mean(trip_time, na.rm = TRUE)
  )
print(DivyTripMember)
## # A tibble: 2 × 5
##   member_casual trip_count total_trip_time maxl_trip_time average_trip_time
##   <chr>              <int>           <dbl>          <dbl>             <dbl>
## 1 casual             67692        6075789.        177140.              89.8
## 2 member            716593        9544417.        101607.              13.3

Male VS Female in 2019 (Membership Type)

gender_count <- DivyCombined %>%
  filter(gender %in% c("Male", "Female")) %>%
  group_by(gender, member_casual) %>%
  summarise(count = n(), .groups = "drop")
custom_colors <- c("Male" = "#7bebd8", "Female" = "#8335e5")
ggplot(gender_count, aes(x = member_casual, y = count, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.6) + 
  geom_text(aes(label = count), 
  position = position_dodge(width = 1), vjust = -.1, size = 5,color = "#6929c4") +
  scale_y_continuous(labels = scales::label_number(scale = 1)) + 
  scale_fill_manual(values = custom_colors) + 
  labs(title = "Count of Gender by Member Type 2019",
       x = "Gender",
       y = "Count of Gender",
       fill = "Member Type") +
  theme_minimal() +
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12))

Now to view the distance based behaviour of user type, i couldn’t get it by the code inside RMarkdown.
avg_trip_distance <- DivyCombined %>% group_by(member_casual) %>% summarise(avg_trip_miles = mean(trip_miles, na.rm = TRUE)) custom_colors <- c(“member” = “#0072c3”, “casual” = “#007d79”) ggplot(avg_trip_distance, aes(x = member_casual, y = avg_trip_miles, fill = member_casual)) + geom_bar(stat = “identity”, width = 0.6) + scale_fill_manual(values = custom_colors) + labs(title = “2020 Average Trip Distance by Member Type”, x = “Member Type”, y = “Avg Trip Distance (miles)”, fill = “Member Type”) + theme_minimal() + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12))

Kept getting the error!

processing file: DataAnyliticsCapstone.Rmd Error in summarise(): ℹ In argument: avg_trip_miles = mean(trip_miles, na.rm = TRUE). ℹ In group 1: member_casual = "casual". Caused by error: ! object ‘trip_miles’ not found Backtrace:

I am uploading the picture i made from R Script. R-Script Generated Image

Findings

Limitations

There were some limitations on data analysis. - Distance is based solely on the data from 2020 and gender is based on 2019 data. - Trip time includes the time as little as 1 minutes and those rows with negative time is completely removed for analysis. - Data is used only for First quarters of 2019 and 2020, which doesn’t show the trend in summer, which might be busier. - Data is limited to a single city of the US. - Distance for 2020 is calculated based on straight line between two sets of Longitude and Latitude, not the actual length of the road which mught be more.

Analysis

Action (Recommendation)

Download Complete PDF Report