Since I decided to use R for Data Exploration, Cleaning, Analyzing, Visualization and Interpretation, I need to install some R packages to work with. Which i did during the learning phase. I can install packages using install.packages(dplyr) and install.packages(ggplot2) and load them using library(dplyr) and library(ggplot2) functions to use in the session.
Setting up RMarkdown with Second Chunk of Codes, (First one is used for heading and information above.) This is to load ‘dplyr’ and ‘ggplot2’ in RMarkdown environment.
So far I don’t know how the dataset looks like. I decided to use RMarkdown to view the dataset and also be able to keep track of entire data cleaning and visualization process. I just finished asking questions and understanding business needs better even before seeing the data and information it contains. With the clear understandin of the objectives and requirements, I come uo with three major questions.
I already downloaded Divvy_Trips_2019/2020_q1.csv datasets, copied into main RDirectory extracted .zip file and renamed them as Divy2019 and Divy2020, which is located at the main folder. Lets import and rename it as ‘Divy2019’ in RMarkdown.
Divy2019 <- read.csv("Divy2019.csv")
summary(Divy2019)
## trip_id start_time end_time bikeid
## Min. :21742443 Length:365069 Length:365069 Min. : 1
## 1st Qu.:21848765 Class :character Class :character 1st Qu.:1777
## Median :21961829 Mode :character Mode :character Median :3489
## Mean :21960872 Mean :3429
## 3rd Qu.:22071823 3rd Qu.:5157
## Max. :22178528 Max. :6471
##
## tripduration from_station_id from_station_name to_station_id
## Length:365069 Min. : 2.0 Length:365069 Min. : 2.0
## Class :character 1st Qu.: 76.0 Class :character 1st Qu.: 76.0
## Mode :character Median :170.0 Mode :character Median :168.0
## Mean :198.1 Mean :198.6
## 3rd Qu.:287.0 3rd Qu.:287.0
## Max. :665.0 Max. :665.0
##
## to_station_name usertype gender birthyear
## Length:365069 Length:365069 Length:365069 Min. :1900
## Class :character Class :character Class :character 1st Qu.:1975
## Mode :character Mode :character Mode :character Median :1985
## Mean :1982
## 3rd Qu.:1990
## Max. :2003
## NA's :18023
Divy2020 <- read.csv("Divy2020.csv")
summary(Divy2020)
## ride_id rideable_type started_at ended_at
## Length:426887 Length:426887 Length:426887 Length:426887
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_name start_station_id end_station_name end_station_id
## Length:426887 Min. : 2.0 Length:426887 Min. : 2.0
## Class :character 1st Qu.: 77.0 Class :character 1st Qu.: 77.0
## Mode :character Median :176.0 Mode :character Median :175.0
## Mean :209.8 Mean :209.3
## 3rd Qu.:298.0 3rd Qu.:297.0
## Max. :675.0 Max. :675.0
## NA's :1
## start_lat start_lng end_lat end_lng
## Min. :41.74 Min. :-87.77 Min. :41.74 Min. :-87.77
## 1st Qu.:41.88 1st Qu.:-87.66 1st Qu.:41.88 1st Qu.:-87.66
## Median :41.89 Median :-87.64 Median :41.89 Median :-87.64
## Mean :41.90 Mean :-87.64 Mean :41.90 Mean :-87.64
## 3rd Qu.:41.92 3rd Qu.:-87.63 3rd Qu.:41.92 3rd Qu.:-87.63
## Max. :42.06 Max. :-87.55 Max. :42.06 Max. :-87.55
## NA's :1 NA's :1
## member_casual
## Length:426887
## Class :character
## Mode :character
##
##
##
##
Observing the glimpses of Divvy_Trips_2019_Q1 public dataset for 2019 Q1 and Divvy_Trips_2020_q1 for 2020 Q1, i think we can find values like trip length, days of the week and distance for further analysis. Then we can work on cleaning process like convert columns to represent correct data type, rename, add and delete columns to be consistent with data from another year. Some incomplete rows probably needs to be excluded in analysis. I am assigned with the First Question to analyze the bike use patterns by casual riders and members. Observing the data from 2019, the column name tripduration is shown in seconds and doesn’t look consistant, so finding out new column name trip_time with the difference from start_time and end_time.
First let’s see if we can find existing tripduration in decending order and be able to view the higher second amount first.
Divy2019 <- Divy2019 %>% arrange(desc(tripduration))
Since it is still showing data between 999.0 and 1.000 only, i tried to convert seconds into minutes. Which endup creating an error.
Create whole new column ‘trip_time’ in minutes from the difference of start_time and end_time column of the Table Divy2019 using mutate function. To use the function we must make sure both data type is formatted as date, which is done using the code ’as.POSIXct and must have same format as “%Y-%m-%d %H:%M:%S”.
Divy2019 <- Divy2019 %>%
mutate(
start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
end_time = as.POSIXct(end_time, format = "%Y-%m-%d %H:%M:%S"),
trip_time = as.numeric(difftime(end_time, start_time, units = "mins"))
)
Now to view three columns start_time, trip_time and usertype using head function. .
head(Divy2019 %>%
select(start_time, trip_time, usertype))
## start_time trip_time usertype
## 1 2019-01-02 06:54:00 16.65 Subscriber
## 2 2019-01-03 07:12:47 16.65 Subscriber
## 3 2019-01-03 14:14:15 16.65 Subscriber
## 4 2019-01-04 16:59:58 16.65 Subscriber
## 5 2019-01-06 00:53:23 16.65 Subscriber
## 6 2019-01-08 08:31:25 16.65 Subscriber
lets change the column ‘usertpe’ to ‘member_casual’ to make consistent with the column from 2020.
Divy2019 <- Divy2019 %>%
rename(member_casual = usertype)
Now changing all Customer to casual
Divy2019 <- Divy2019 %>%
mutate(member_casual = ifelse(member_casual == "Customer", "casual",
member_casual))
and all Subscriber to member to match with the data from 2020.
Divy2019 <- Divy2019 %>%
mutate(member_casual = ifelse(member_casual == "Subscriber", "member",
member_casual))
Now we would like to find the day of the week the trip was started. Lets use mutate functions to get the information.
Divy2019 <- Divy2019 %>%
mutate(
start_time = as.POSIXct(start_time, format = "%Y-%m-%d %H:%M:%S"),
start_day = weekdays(start_time)
)
I observed some discrepancies in other column name for 2019 and 2020, lets change the column name trip_id in 2019 to ride_id same way like 2020 for consistency. Lets change other columns in 2019 to match with those at 2020.
Divy2019 <- Divy2019 %>%
rename(
ride_id = trip_id,
started_at = start_time,
ended_at = end_time,
start_station_name = from_station_name,
end_station_name = to_station_name,
start_station_id = from_station_id,
end_station_id = to_station_id,
)
Exclude time part from column start_time, (only shows the date)
Divy2019 <- Divy2019 %>%
mutate(start_date = as.Date(started_at))
Remove bikeid from Divy2019 because there is no same information in Divy2020
Divy2019 <- Divy2019 %>%
select(-bikeid)
Remove tripduration also, because we already calculated new trip_time in minutes.
Divy2019 <- Divy2019 %>%
select(-tripduration)
Now let’s see count, sum and average trip_time by member type.
trip_summary <- Divy2019 %>%
group_by(member_casual) %>%
summarise(total_trip_time = sum(trip_time, na.rm = TRUE),
trip_number = n(),
average_trip_time = mean(trip_time, na.rm = TRUE),
)
print(trip_summary)
## # A tibble: 2 × 4
## member_casual total_trip_time trip_number average_trip_time
## <chr> <dbl> <int> <dbl>
## 1 casual 1434461. 23163 61.9
## 2 member 4749456. 341906 13.9
remove trip_time decimals to 2 digit.
Divy2019 <- Divy2019 %>%
mutate(trip_time = round(trip_time, 2))
Looking at the Data for 2020 Q1, there is some errors, inconsistencies and missing values that we need to fix before analyzing. Divy2020 needs tripduration calculated in minutes based on started_at and ended_at time. Calculate the start of the weekday for further analysis. Lets begin with calculation of trip_time in minutes. Calculation of trip_time in minutes, like above in 2019.
Divy2020 <- Divy2020 %>%
mutate(
started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),
ended_at = as.POSIXct(ended_at, format = "%Y-%m-%d %H:%M:%S"),
trip_time = as.numeric(difftime(ended_at, started_at, units = "mins"))
)
Now lets remove the negative and values less than 1 minute.
Divy2020 <- Divy2020 %>%
filter(!is.na(trip_time) & trip_time >= 1)
Limit trip_time to 2 digits after decimals.
Divy2020 <- Divy2020 %>%
mutate(trip_time = round(trip_time, 2))
Now we need to find the day of the week trip was started, lets find out by using some functions. as.PPSIXct function is used to assign the value as date and time for calculation.
Divy2020 <- Divy2020 %>%
mutate(
started_at = as.POSIXct(started_at, format = "%Y-%m-%d %H:%M:%S"),
start_day = weekdays(started_at)
)
Find only the date from started_at column and remove time part.
Divy2020 <- Divy2020 %>%
mutate(start_date = as.Date(started_at))
Lets remove the few columns that we will not need.
Divy2020 <- Divy2020 %>%
select(-rideable_type)
It would be cool if we can also find the distance in miles based on Longitude and Lattitude of the start and end stations. Found out we need to install and load library(geosphere) for the calculation. SORRY !!! I was not able to show the firs part of code (to convert latitude, longitude into miles) due to some error on Knit process1. But here is the result with Miles.
head(Divy2020 %>%
select(start_lat, start_lng, end_lat, end_lng))
## start_lat start_lng end_lat end_lng
## 1 41.9665 -87.6884 41.9671 -87.6674
## 2 41.9616 -87.6660 41.9542 -87.6644
## 3 41.9401 -87.6455 41.9402 -87.6530
## 4 41.8846 -87.6319 41.8918 -87.6206
## 5 41.8856 -87.6418 41.8899 -87.6343
## 6 41.8899 -87.6343 41.8846 -87.6446
Limit two digits after decimal. I did this but unable to show the code, probably linked to how i was unable to show the distance based on the start and end longitude, latitude.
Let’s try finding distance for the Divy2019 but this time with the physical address.
This looks like a little too advance requiring APIs and plugins.
Now to calculate the distance! Sorry!!! i have to practice little more to get to this point!
I tried to combine the two datasets first but got the error on rider_id types, one is integer and another is character. Lets chance Divy2019 rider_id to char.
Divy2019$ride_id <- as.character(Divy2019$ride_id)
Keeping the business needs in mind to solve the problem, lets find the insights of data. First of all lets combine both datasets.
Divy2019 <- Divy2019 %>% mutate(dataset = "Divy2019")
Divy2020 <- Divy2020 %>% mutate(dataset = "Divy2020")
Creating a new dataset called DivyCombined with both datasets together.
DivyCombined <- bind_rows(Divy2019, Divy2020)
Create another table, DivyTripsDays with on days of the week and trip_time by member type!!
DivyTripDays <- DivyCombined %>%
group_by(start_day, member_casual) %>%
summarise(
total_trip_time = sum(trip_time, na.rm = TRUE),
max_trip_time = max(trip_time, na.rm = TRUE),
avg_trip_time = mean(trip_time, na.rm = TRUE),
count_trips = n()
) %>%
arrange(member_casual)
## `summarise()` has grouped output by 'start_day'. You can override using the
## `.groups` argument.
print(DivyTripDays)
## # A tibble: 14 × 6
## # Groups: start_day [7]
## start_day member_casual total_trip_time max_trip_time avg_trip_time
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Friday casual 813170. 117263. 102.
## 2 Monday casual 442705. 39328. 79.3
## 3 Saturday casual 1110728. 65602. 82.5
## 4 Sunday casual 1572698. 143877. 84.6
## 5 Thursday casual 1006545. 177140. 141.
## 6 Tuesday casual 555723. 69445. 76.2
## 7 Wednesday casual 574220. 74703. 75.0
## 8 Friday member 1528997. 65448. 13.3
## 9 Monday member 1513081. 94103. 13.8
## 10 Saturday member 964068. 79158. 16.3
## 11 Sunday member 975177. 43215. 16.3
## 12 Thursday member 1475820. 12659. 11.8
## 13 Tuesday member 1640952. 101607. 12.9
## 14 Wednesday member 1446324. 11097. 11.9
## # ℹ 1 more variable: count_trips <int>
Lets create first visuals as bar chart from the DivyTripDats table we just created.
custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = total_trip_time, fill = member_casual)) +
geom_bar(stat = "identity", fun = "mean", position = "dodge") +
scale_y_continuous(labels = scales::label_number(scale = 1)) +
scale_fill_manual(values = custom_colors) +
labs(title = "Trip Duration by Member Type",
x = "Start Day of the Trip",
y = "Total Trip Duration in Minutes",
fill = "Member Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`
We can also check, if there is any relationship between membership type and daverage trip time in minutes.
custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = avg_trip_time, fill = member_casual)) +
geom_bar(stat = "identity", fun = "mean", position = "dodge") +
scale_y_continuous(labels = scales::label_number(scale = 1)) +
scale_fill_manual(values = custom_colors) +
labs(title = "Average Trip Time by Member Type",
x = "Start Day of the Trip",
y = "Average Trips Time in Minutes",
fill = "Member Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`
We can also check, if there is any relationship between membership type and Number of trips they complete.
custom_colors <- c("member" = "#7bebd8", "casual" = "#8335e5")
ggplot(DivyTripDays, aes(x = start_day, y = count_trips, fill = member_casual)) +
geom_bar(stat = "identity", fun = "mean", position = "dodge") +
scale_y_continuous(labels = scales::label_number(scale = 1)) +
scale_fill_manual(values = custom_colors) +
labs(title = "Number of Trip by Member Type",
x = "Start Day of the Trip",
y = "Number of Trips",
fill = "Member Type") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "identity", fun = "mean", position = "dodge"):
## Ignoring unknown parameters: `fun`
Like to observe how the member and casual riders use bicycle from and to locations. Longer the distance the bicycle used, probably less chance the bike can be used by others.
Viewing the line chart based on customer type.
ggplot(DivyTripDays, aes(x = start_day, y = count_trips, color = member_casual, group = member_casual)) +
geom_line(size = 1) +
geom_point(size = 2) +
scale_color_manual(values = c("member" = "#7bebd8", "casual" = "#8335e5")) +
labs(title = "Trip Counts Over Days by Member Type",
x = "Start Day",
y = "Total Trips",
color = "Member Type") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Summary <- DivyCombined %>%
group_by(member_casual) %>%
summarise(average_time = mean(trip_time, na.rm = TRUE))
ggplot(Summary, aes(x = "", y = average_time, fill = member_casual)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Average Trip Time vs Member Type",
x = "",
y = "",
fill = "Member Type") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
scale_fill_manual(values = c("member" = "#7bebd8", "casual" = "#8335e5"))
Now to explore more data stories. Lets find Count of rides and sum of trip_time by member type!!
DivyTripMember <- DivyCombined %>%
group_by(member_casual) %>%
summarise(
trip_count = n(),
total_trip_time = sum(trip_time, na.rm = TRUE),
maxl_trip_time = max(trip_time, na.rm = TRUE),
average_trip_time = mean(trip_time, na.rm = TRUE)
)
print(DivyTripMember)
## # A tibble: 2 × 5
## member_casual trip_count total_trip_time maxl_trip_time average_trip_time
## <chr> <int> <dbl> <dbl> <dbl>
## 1 casual 67692 6075789. 177140. 89.8
## 2 member 716593 9544417. 101607. 13.3
Male VS Female in 2019 (Membership Type)
gender_count <- DivyCombined %>%
filter(gender %in% c("Male", "Female")) %>%
group_by(gender, member_casual) %>%
summarise(count = n(), .groups = "drop")
custom_colors <- c("Male" = "#7bebd8", "Female" = "#8335e5")
ggplot(gender_count, aes(x = member_casual, y = count, fill = gender)) +
geom_bar(stat = "identity", position = "dodge", width = 0.6) +
geom_text(aes(label = count),
position = position_dodge(width = 1), vjust = -.1, size = 5,color = "#6929c4") +
scale_y_continuous(labels = scales::label_number(scale = 1)) +
scale_fill_manual(values = custom_colors) +
labs(title = "Count of Gender by Member Type 2019",
x = "Gender",
y = "Count of Gender",
fill = "Member Type") +
theme_minimal() +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12))
Now to view the distance based behaviour of user type, i couldn’t get
it by the code inside RMarkdown.
avg_trip_distance <- DivyCombined %>% group_by(member_casual)
%>% summarise(avg_trip_miles = mean(trip_miles, na.rm = TRUE))
custom_colors <- c(“member” = “#0072c3”, “casual” = “#007d79”)
ggplot(avg_trip_distance, aes(x = member_casual, y = avg_trip_miles,
fill = member_casual)) + geom_bar(stat = “identity”, width = 0.6) +
scale_fill_manual(values = custom_colors) + labs(title = “2020 Average
Trip Distance by Member Type”, x = “Member Type”, y = “Avg Trip Distance
(miles)”, fill = “Member Type”) + theme_minimal() + theme(axis.text.x =
element_text(size = 12), axis.text.y = element_text(size = 12))
Kept getting the error!
processing file: DataAnyliticsCapstone.Rmd Error in
summarise()
: ℹ In argument:
avg_trip_miles = mean(trip_miles, na.rm = TRUE)
. ℹ In group
1: member_casual = "casual"
. Caused by error: ! object
‘trip_miles’ not found Backtrace:
I am uploading the picture i made from R Script.
In average, Casual customers ride the bike 4.5 times longer per trip, compared to members.
Member exceeds the total trip time everyday to the casual customers. Casual customer exceeds number of trips on weekends.
Members are using the ride frequently making higher numbers of trips but for short period of time.
2019 Data shows, Male to Female ratio of Members is 4.2:1 and casual customer males are little over double the female riders.
Based on 2020 Longitide and Latitude information, In average, Casual customers travel longer distance compared to Members.
There were some limitations on data analysis. - Distance is based solely on the data from 2020 and gender is based on 2019 data. - Trip time includes the time as little as 1 minutes and those rows with negative time is completely removed for analysis. - Data is used only for First quarters of 2019 and 2020, which doesn’t show the trend in summer, which might be busier. - Data is limited to a single city of the US. - Distance for 2020 is calculated based on straight line between two sets of Longitude and Latitude, not the actual length of the road which mught be more.
Based on the business needs and questions to be answered, I can observe from the data that the Members prefer to use the bike for work. looks like they use bike service to go and come back from the work.
Members are using the bike in consistent basis every weekday but drops during weekends. Casual customers doesn’t have specific riding pattern other than Monday being least used.
When bike is taken away for longer time and to the farthest stations there might be less chance of being used because of limited number of customers in farthest destinations.
There is an opportunity to increase Female membership since they are only 19% of the Male Members, where casual female riders are 32%.