One of the blind spots of the public bicycle policy is that passengers do not drive round-trip, but mostly one-way. New Yorkers often use shared bicycles during rush hour. After people use bicycles a lot, it often happens that there is no bicycle for others to use at the bicycle stop or there is no empty stand to return bicycle. In order to minimize the occurrence of this, various strategies are needed to redistribute bicycles at the right place and time. The first step to accomplish this mission is to accurately predict the demand for each stand. Due to physical limitations in calculating New York City’s annual bicycle usage, this study was limited to only for five weeks from September to December.
The demand for shared bicycles varies greatly depending on the time of the day. Therefore, this study focused on predicting the demand for bicycles used at stops every hour. The unit of 1 hour is not a big problem in conducting this study as a framework, but it is also a long time to accurately predict the peak of bicycle use during rush hours. Since most of New York City’s can be reached anywhere by bicycle in less than 15 minutes, subdivision of this 1 hour unit into 15 minutes will result in a higher resolution of predictions.
This is section for libraries, functions and overall styling settings.
# Rmarkdown global setting
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(results = 'hide')
knitr::opts_chunk$set(fig.keep = 'all')
#----------------------------------------------------------------------------------------------------------
# Import libraries
library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)# plot correlation plot
library(corrplot)
library(corrr) # another way to plot correlation plot
library(kableExtra)
library(jtools) # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr) # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(knitr)
library(rmarkdown)
library(RSocrata)
library(viridis)
library(ggplot2)
library(stargazer)
library(XML)
library(data.table)
library(ggpmisc)
library(patchwork)
library(spatstat)
library(raster)
library(classInt) # for KDE and ML risk class intervals
library(tableHTML)
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(tigris)
library(riem)
library(RSocrata)
library(gganimate)
library(gifski)
#----------------------------------------------------------------------------------------------------------
# Temp
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"
# Etc
options(scipen=999)
options(tigris_class = "sf")
#----------------------------------------------------------------------------------------------------------
# functions
st_c <- st_coordinates
st_coid <- st_centroid
mapThememin <- function(base_size = 9, title_size = 12, small_size = 8) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
strip.text.x = element_text(size = small_size, face="italic"),
strip.text.y = element_text(size = small_size, face="italic"),
strip.background = element_rect(colour="transparent", fill="transparent"),
legend.title = element_text(size = small_size),
legend.text = element_text(size = small_size),
legend.key.size = unit(0.4, "cm"))
}
mapThememin2 <- function(base_size = 8, title_size = 10, small_size = 6) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
axis.ticks = element_blank(),
axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.background = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
strip.text.x = element_text(size = base_size),
strip.text.y = element_text(size = base_size),
strip.background = element_rect(colour="transparent", fill="transparent"),
legend.title = element_text(size = small_size),
legend.text = element_text(size = small_size),
legend.key.size = unit(0.25, "cm"))
}
corTheme <- function(base_size = 10, title_size = 12, small_size = 8){
theme(axis.text = element_blank(),
axis.ticks = element_blank(),
text = element_text(size = 10),
panel.background = element_rect(fill = greyPalette5[1]),
axis.title.x = element_text(size = small_size),
axis.title.y = element_text(size = small_size),
plot.subtitle = element_text(hjust = 0.5, size = base_size),
plot.title = element_text(hjust = 0.5, size = title_size),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
strip.background = element_rect(colour="transparent", fill="transparent"))
}
corTheme2 <- function(base_size = 10, title_size = 12, small_size = 8){
theme(axis.text = element_text(size = small_size),
text = element_text(size = 10),
panel.background = element_rect(fill = greyPalette5[1]),
axis.title.x = element_text(size = small_size),
axis.title.y = element_text(size = small_size),
plot.subtitle = element_text(hjust = 0.5, size = base_size, face="italic"),
plot.title = element_text(hjust = 0.5, size = title_size),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
strip.background = element_rect(colour="transparent", fill="transparent"),
strip.text.x = element_text(size = small_size, face="italic"),
strip.text.y = element_text(size = small_size, face="italic"),
legend.text = element_text(size = small_size),
legend.title = element_text(size = small_size),
legend.key.size = unit(0.4, "cm"))
}
corTheme3 <- function(base_size = 9, title_size = 11, small_size = 7){
theme(axis.text = element_text(size = small_size),
text = element_text(size = 10),
panel.background = element_rect(fill = greyPalette5[1]),
axis.title.x = element_text(size = small_size),
axis.title.y = element_text(size = small_size),
plot.subtitle = element_text(hjust = 0.5, size = base_size, face="italic"),
plot.title = element_text(hjust = 0.5, size = title_size),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
legend.text = element_text(size = small_size))
}
corTheme4 <- function(base_size = 9, title_size = 11, small_size = 7){
theme(axis.text = element_text(size = small_size),
text = element_text(size = 10),
panel.background = element_rect(fill = greyPalette5[1]),
axis.title.x = element_text(size = small_size),
axis.title.y.right = element_text(size = small_size),
plot.subtitle = element_text(hjust = 0.5, size = base_size, face="italic"),
plot.title = element_text(hjust = 0.5, size = title_size),
plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}
q5 <- function(variable) {as.factor(ntile(variable, 5))}
q <- function(variable) {as.factor(ntile(variable, 5))}
qBr <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]],0),
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]],
c(.01,.2,.4,.6,.8), na.rm=T),
digits = 3))
}
}
qBr2 <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(quantile(round(df[[variable]]*100,0)/100,
c(.01,.2,.4,.6,.8), na.rm=T))
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]],
c(.01,.2,.4,.6,.8), na.rm=T),
digits = 3))
}
}
qBr3 <- function(df, variable, rnd) {
if (missing(rnd)) {
as.character(round(quantile(round(df[[variable]]*1000000,0),
c(.01,.2,.4,.6,.8), na.rm=T)),0)
} else if (rnd == FALSE | rnd == F) {
as.character(formatC(quantile(df[[variable]],
c(.01,.2,.4,.6,.8), na.rm=T),
digits = 3))
}
}
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
nn_function <- function(measureFrom,measureTo,k) {
measureFrom_Matrix <- as.matrix(measureFrom)
measureTo_Matrix <- as.matrix(measureTo)
nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
output <- as.data.frame(nn) %>%
rownames_to_column(var = "thisPoint") %>%
gather(points, point_distance, V1:ncol(.)) %>%
arrange(as.numeric(thisPoint)) %>%
group_by(thisPoint) %>%
summarize(pointDistance = mean(point_distance)) %>%
arrange(as.numeric(thisPoint)) %>%
dplyr::select(-thisPoint) %>%
pull()
return(output)
}
myCrossValidate <- function(dataset, id, dependentVariable, indVariables) {
allPredictions <- data.frame()
cvID_list <- unique(dataset[[id]])
for (i in cvID_list) {
thisFold <- i
cat("This hold out fold is", thisFold, "\n")
fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
fold.test <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>%
dplyr::select(id, geometry, indVariables, dependentVariable)
regression <-
glm(countMVTheft ~ ., family = "poisson",
data = fold.train %>%
dplyr::select(-geometry, -id))
thisPrediction <-
mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
allPredictions <-
rbind(allPredictions, thisPrediction)
}
return(st_sf(allPredictions))
}
#----------------------------------------------------------------------------------------------------------
# Colors ("https://coolors.co/gradient-palette/a8f368-f9035e?number=7")
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
H_bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#FF440C")
orangePalette5 <- c("#FFF2E8","#FFD6B6","#FEB984","#FE9D51","#FD801F")
orange2Palette5 <- c("#FFDFD0","#FFB89F","#FF926E","#FF6B3D","#FF440C")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")
greenPalette5 <- c("#edf8e9","#bae4b3","#74c476","#31a354","#006d2c")
purplePalette5 <- c("#f2f0f7","#cbc9e2","#9e9ac8","#756bb1","#54278f")
dotwPalette7 <- c("#FD801F","#9ecae1","#6baed6","#4292c6","#2171b5","#084594","#FEB984")
#----------------------------------------------------------------------------------------------------------
# LoadAPI(Min's key)
census_api_key("4bbe4bead4e5817f6a6b79e62c5bea69e77f1887", overwrite = TRUE)##-------------------- NYC Bike trip data-----------------------------##
Bike_trip_data_202209 <- read.csv("C:/Users/vestalk/Desktop/00_Upenn/20.Fall/04.MUSA 5080 Public Policy Analytics/Assignment/06.Bike Share Prediction/00.DATA/202209_biketrip.csv") %>% dplyr::select(-X)
Bike_trip_data_202210 <- read.csv("C:/Users/vestalk/Desktop/00_Upenn/20.Fall/04.MUSA 5080 Public Policy Analytics/Assignment/06.Bike Share Prediction/00.DATA/202210_biketrip.csv") %>% dplyr::select(-X)
Bike_trip_data_202209_10 <- rbind(Bike_trip_data_202209, Bike_trip_data_202210)
# sampling Data 40%
Bike_trip_data_202209_10_sample <- sample_n(Bike_trip_data_202209_10, length(Bike_trip_data_202209_10$ride_id)*0.4)
Bike_trip_data_20220903_20221007 <- Bike_trip_data_202209_10_sample %>% mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
day = floor_date(ymd_hms(started_at), unit = "day"),
week = week(interval60),
dotw = wday(interval60, Sys.setenv("LANGUAGE" = "es"))) %>%
filter(week == 36 | week == 37 | week == 38 | week == 39 | week == 40) %>% arrange(started_at)
Bike_trip_data_20220903_20221007_full <- Bike_trip_data_202209_10 %>% mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
day = floor_date(ymd_hms(started_at), unit = "day"),
week = week(interval60),
dotw = wday(interval60, Sys.setenv("LANGUAGE" = "es"))) %>%
filter(week == 36 | week == 37 | week == 38 | week == 39 | week == 40) %>% arrange(started_at)
glimpse(Bike_trip_data_20220903_20221007)
length(Bike_trip_data_20220903_20221007)
rm(Bike_trip_data_202209)
rm(Bike_trip_data_202210)
rm(Bike_trip_data_202209_10)
rm(Bike_trip_data_202209_10_sample)##-------------------- NYC Census data-----------------------------##
NewyorkCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2019,
state = 36,
geometry = TRUE,
county=c("61"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
NewyorkTracts <-
NewyorkCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
dplyr::select(GEOID, geometry) %>%
st_sf %>%
st_transform('EPSG:2263')##--------------------NYC Bike trip data combined with census tracts-----------------------------##
## study code later in detail ##
Bike_trip_data_20220903_20221007_tracts_full <- st_join(Bike_trip_data_20220903_20221007_full %>%
filter(is.na(start_lng) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lng) == FALSE) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4236)%>%
st_transform('EPSG:2263'),
NewyorkTracts%>% st_transform('EPSG:2263'),
join=st_intersects) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4236)%>% st_transform('EPSG:2263') %>%
st_join(., NewyorkTracts %>% st_transform(crs=2263),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lng = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry) %>% na.omit()
Bike_trip_data_20220903_20221007_tracts <- st_join(Bike_trip_data_20220903_20221007 %>%
filter(is.na(start_lng) == FALSE &
is.na(start_lat) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lng) == FALSE) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4236)%>%
st_transform('EPSG:2263'),
NewyorkTracts%>% st_transform('EPSG:2263'),
join=st_intersects) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry)%>%
st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4236)%>% st_transform('EPSG:2263') %>%
st_join(., NewyorkTracts %>% st_transform(crs=2263),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end_lng = unlist(map(geometry, 1)),
end_lat = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
dplyr::select(-geometry) %>% na.omit()
station_number <- nrow(Bike_trip_data_20220903_20221007_tracts_full[!duplicated(Bike_trip_data_20220903_20221007_tracts_full$start_station_id),])
station_trip_sum <- Bike_trip_data_20220903_20221007_tracts_full %>% group_by(start_station_id) %>% summarise(total_trip = n(),
start_lng = first(start_lng),
start_lat = first(start_lat),
total_trip_week = n()/5) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')
station_number <- nrow(Bike_trip_data_20220903_20221007_tracts_full[!duplicated(Bike_trip_data_20220903_20221007_tracts_full$start_station_id),])
sum(station_trip_sum$total_trip)
mean(station_trip_sum$total_trip_week)
ggplot() +
geom_sf(data = NewyorkTracts, fill = greyPalette5[2], color = "transparent") +
geom_sf(data = station_trip_sum, aes(color = q5(total_trip)), size=0.6) +
scale_color_brewer(palette = "Blues",labels = round(as.numeric(qBr(station_trip_sum, "total_trip")))) +
labs(title= "Bike share trips of Manhattan",
subtitle = "674 stations \n 2,563,358 Trips \n 2022-09-03 ~ 2022-10-07",
caption = "Figure 2-1",
color = 'Total Trips\n(quantile)') +
mapThememin()+
theme(panel.background = element_blank(),
legend.key.size = unit(0.25, "cm"),
legend.key = element_rect(colour = NA, fill = NA)) Figure 2-1-a shows total bike share trips of each station during five weeks long research period, 2022/09/03~2022/10/07. We can see clear spatially clustered pattern of bike share usage showing high usage rate on center area, on the other hand, the northern part of Manhattan shows significantly low usage rate for some reasons.
ggplot(station_trip_sum)+
geom_histogram(aes(total_trip_week), binwidth =100, fill = bluePalette5[3], color = "white", size = 2)+
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
labs(title="Bike share trips per week of each station",
subtitle = "average 760 trips per week",
x="Trip Counts",
y="Number of Stations",
caption = "Figure 2-1-b") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed", size = 0.25, color = greyPalette5[2]))rm(Bike_trip_data_20220903_20221007_tracts_full)
rm(Bike_trip_data_20220903_20221007_full)
rm(Bike_trip_data_202209_10)Looking at Figure 2-1-b, which shows the distribution of bicycle stops by the average trip counts per week, it is skewed to the left. This indicates that a considerable number of bicycle stops are not being used. In other words, some of bicycle stops quite well used. Our aim is to redistribute the appropriate level of bicycles to each stop so that shared bicycles can be used well. If the prediction for bicycle stops with Trip Counts above the average is accurate, it can be said that a reasonable model was designed.
NY_bike_lane <-
st_read("https://data.cityofnewyork.us/api/geospatial/7vsa-caz7?method=export&format=GeoJSON") %>%
st_transform('EPSG:2263') %>% st_as_sf() %>% filter(allclasses == "I")
MH_bike_lane <- st_intersection(NY_bike_lane, NewyorkTracts %>% st_transform('EPSG:2263'))
summary(MH_bike_lane)
sum(as.numeric(MH_bike_lane$shape_le_1))
MH_bike_lane_Bf <-
st_union(st_buffer(MH_bike_lane, 820)) %>% #250m buffer
st_sf() %>%
st_transform('EPSG:2263')
ggplot() +
geom_sf(data = NewyorkTracts, fill = bluePalette5[1], color = "transparent") +
geom_sf(data = station_trip_sum, color = bluePalette5[5], fill = bluePalette5[4], size =0.3) +
geom_sf(data = MH_bike_lane, color = bluePalette5[3], size=0.8) +
labs(title= "Protected Bike Lane of Manhattan",
subtitle = "Total 801,178 feet",
caption = "Figure 2-2") +
mapThememin()+
theme(panel.background = element_blank(),
legend.key.size = unit(0.25, "cm"),
legend.key = element_rect(colour = NA, fill = NA)) Bike_trip_data_20220903_20221007_tracts <-
rbind(st_as_sf(Bike_trip_data_20220903_20221007_tracts, coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')[MH_bike_lane_Bf,] %>%
st_drop_geometry() %>% left_join(Bike_trip_data_20220903_20221007_tracts) %>% mutate(BikeLane = "Near250m"),
st_as_sf(Bike_trip_data_20220903_20221007_tracts, coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')[MH_bike_lane_Bf, op = st_disjoint] %>%
st_drop_geometry() %>% left_join(Bike_trip_data_20220903_20221007_tracts) %>% mutate(BikeLane = "Beyond250m")) Figure 2-2 shows the current status of Protected Lane on New York City bike lanes. A total of 801,178 ft is currently installed, and you can see that most of the Protected Lane is concentrated south of Manhattan.
New York climate data was collected using Riem_measures. The weather station we used is located in Central Park and have observed climate data since 1943. More information is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=NY_ASOS.
##--------------------NYC weather Data-----------------------------##
weather.Panel <-
riem_measures(station = "NYC", date_start = "2022-09-03", date_end = "2022-10-07") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
glimpse(weather.Panel)
weather.Panel.gather <- weather.Panel %>% gather(Variable, Value, -interval60)
ggplot(weather.Panel.gather, aes(interval60, Value)) + geom_line(color = bluePalette5[4], size = 0.5) +
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
facet_wrap(~Variable, ncol = 1, scales = "free")+
labs(title = "New York Weather Data",
subtitle = "2022-09-03 ~ 2022-10-07",
x = "",
y = "",
caption = "Figure 2-2") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "dashed", size = 0.25, color = greyPalette5[2]))rm(weather.Panel.gather)The Space Time Panel is one of the most important parts of this study. Simply put, it’s an hourly table of how many bicycles were used for 24 hours at each bicycle stop. There are a total of 660 bicycle stops, and a total of 840 (35*24) time matrices are created if a total of 35 days are counted as each time for five weeks. Multiplying this by 660 makes up a Space-Time Panel consisting of a total of 554,400 rows. The basic structure of predicting what the demand for bicycles will be for each hour and for each bicycle stop is completed. In addition to this, we will configure Time-lag Feature and use it for the prediction model.
##--------------------Build Space-Time Panel-----------------------------##
## study code later in detail ##
study.panel <-
expand.grid(interval60=unique(Bike_trip_data_20220903_20221007_tracts$interval60),
start_station_id = unique(Bike_trip_data_20220903_20221007_tracts$start_station_id)) %>%
left_join(., Bike_trip_data_20220903_20221007_tracts %>%
dplyr::select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
distinct() %>%
group_by(start_station_id) %>%
slice(1))
length(unique(Bike_trip_data_20220903_20221007_tracts$start_station_id))
length(unique(Bike_trip_data_20220903_20221007_tracts$interval60))
length(unique(Bike_trip_data_20220903_20221007_tracts$day))*24
nrow(study.panel)
Bikelane_stationid <- Bike_trip_data_20220903_20221007_tracts %>% dplyr::select(start_station_id, BikeLane)
Bikelane_stationid <- Bikelane_stationid[!duplicated(Bikelane_stationid$start_station_id),]
ride.panel <-
Bike_trip_data_20220903_20221007_tracts %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station_id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE) %>%
left_join(Bikelane_stationid, by = "start_station_id")
ride.panel <-
left_join(ride.panel, NewyorkCensus %>% dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))
##--------------------add time lags-----------------------------##
ride.panel <-
ride.panel %>%
arrange(start_station_id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 248,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
holidayLag = replace_na(holidayLag, "0")) %>%
mutate(hour = hour(interval60))
length(unique(ride.panel$start_station_id))
#yday(as.Date("2022-09-05")) labor day##--------------------Time lagged features correlation-----------------------------##
plotData.lag <-
filter(as.data.frame(ride.panel), week == 38) %>%
dplyr::select(starts_with("lag"), Trip_Count) %>%
gather(Variable, Value, -Trip_Count) %>%
mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
"lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
group_by(plotData.lag, Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2))
ggplot(plotData.lag, aes(Value, Trip_Count)) +
geom_point(size = 0.3, color = bluePalette5[3]) +
geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
x=-Inf, y=Inf, vjust = 1.5, hjust = -.2, size = 3, color = bluePalette5[5]) +
geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 0.6) +
facet_wrap(~Variable, ncol = 7, scales = "free") +
labs(title = "Bike share trips as a function of time lags",
subtitle = "One week in September, 2022",
caption = "Figure 3-2",
x = "Lagged Trip Counts",
y = "Trip Counts") +
corTheme()##--------------------Spatial autocorrelation-----------------------------##
Week_trips <- ride.panel %>% group_by(week, Origin.Tract) %>%
summarise(Sum_Trip_Count = sum(Trip_Count)) %>% ungroup() %>% left_join(NewyorkTracts, by = c("Origin.Tract" = "GEOID")) %>%
st_sf()
dotw_trips <- ride.panel %>% group_by(dotw, Origin.Tract) %>%
summarise(Sum_Trip_Count = sum(Trip_Count)) %>% ungroup() %>% left_join(NewyorkTracts, by = c("Origin.Tract" = "GEOID")) %>%
st_sf()
ggplot() + geom_sf(data = Week_trips, aes(fill = q5(Sum_Trip_Count)), color = greyPalette5[1], size = 0.2) +
scale_fill_manual(values = bluePalette5,
labels = qBr(Week_trips, "Sum_Trip_Count"),
name = "Trip Count") +
facet_wrap(~week, ncol = 5) +
labs(title = "Sum of Bike Share Trips",
subtitle = "by week",
caption = "Figure 3-3-a") +
corTheme()+
theme(panel.background = element_blank(),
legend.position = "bottom",
legend.key.size = unit(0.3, "cm")) rename <- c(`일` = "SUN",`월` = "MON",`화`="TUE",`수`="WED",`목`="THU",`금`="FRI",`토`="SAT")
ggplot() + geom_sf(data = dotw_trips, aes(fill = q5(Sum_Trip_Count)), color = greyPalette5[1], size = 0.2) +
scale_fill_manual(values = bluePalette5,
labels = qBr(dotw_trips, "Sum_Trip_Count"),
name = "Trip Count") +
facet_wrap(~dotw, ncol = 7, labeller = as_labeller(rename)) +
labs(title = "Sum of Bike Share Trips",
subtitle = "by day of the week",
caption = "Figure 3-3-b") +
corTheme()+
theme(panel.background = element_blank(),
legend.position = "bottom",
legend.key.size = unit(0.3, "cm")) rm(Week_trips)
rm(dotw_trips)Figure 3-3-a, b shows the spatial autocorrelation of the Bike Trip. The 40th week has a slightly different aspect, but you can see that most of the trips are concentrated in the east and west of Manhattan. There is not much difference in these usage patterns even if you check them by day of the week. This may be related to the distribution of New York’s bicycle infrastructure.
##-----------animated map------------##
week38 <-
filter(Bike_trip_data_20220903_20221007_tracts, week == 38 & dotw == "월")
week38.panel <-
expand.grid(
interval15 = unique(week38$interval15),
Origin.Tract = unique(Bike_trip_data_20220903_20221007_tracts$Origin.Tract))
ride.animation.data <-
mutate(week38, Trip_Counter = 1) %>%
right_join(week38.panel) %>%
group_by(interval15, Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
left_join(NewyorkTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
Trip_Count > 10 ~ "11+ trips")) %>%
mutate(Trips = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
"7-10 trips","10+ trips"))
rideshare_animation <-
ggplot() +
geom_sf(data = ride.animation.data, aes(fill = Trips), color = "transparent") +
scale_fill_manual(values = bluePalette5) +
labs(title = "Bike Trips for one day in Sep 2022",
subtitle = "15 minute intervals: {current_frame}") +
transition_manual(interval15) +
corTheme()+
theme(panel.background = element_blank(),
legend.position = "bottom",
legend.key.size = unit(0.3, "cm"))
rideshare_animation_plot <- animate(rideshare_animation, duration=30, renderer = gifski_renderer())##--------------------Average count of bike trips by precipitation-----------------------------##
st_drop_geometry(ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Precipitation = first(Precipitation)) %>%
mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
group_by(isPercip) %>%
summarize(Mean_Trip_Count = mean(Trip_Count)) %>% na.omit() %>%
ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity", width = 0.2, aes(fill = isPercip)) +
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
scale_fill_manual(values = c(bluePalette5[2],bluePalette5[4]),name = "precipitation")+
labs(title="Average count of bike trips by precipitation",
subtitle ="Manhatten 2022.09.03-2022.10.07",
x="Precipitation",
y="Mean Trip Count",
caption = "Figure 3-4-a") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))Figure 3-4-a is a diagram showing the relationship between the weather and bicycle usage. It is very obvious that usage falls below 50% during the rainy or snowy period.
##--------------------Correlation btw mean trip count and temperature-----------------------------##
st_drop_geometry(ride.panel) %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Temperature = first(Temperature)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point(size = 0.3, color = bluePalette5[3]) +
geom_smooth(method = "lm", se= FALSE, colour = bluePalette5[5], size = 0.6) +
facet_wrap(~week, ncol=8, scales = "free") +
labs(title="Trip Count as a fuction of Temperature by week",
subtitle = "Manhatten 2022.09.03-2022.10.07",
x="Temperature",
y="Mean Trip Count",
caption = "Figure 3-4-b")+
corTheme2()Figure 3-4-b shows the relationship between temperature and Bike share Trip. During the study period, we can confirm that the higher the temperature, the higher the number of bicycle rentals.
unique_bike_station <- Bike_trip_data_20220903_20221007_tracts[!duplicated(Bike_trip_data_20220903_20221007_tracts$start_station_id),] %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')
ggplot() +
geom_sf(data = NewyorkTracts, fill = greyPalette5[2], color = "transparent") +
geom_sf(data = MH_bike_lane_Bf, fill = greyPalette5[1], color = greyPalette5[3], size = 0.1) +
geom_sf(data = unique_bike_station, aes(color = BikeLane), size =0.3) +
scale_colour_manual(values = c(bluePalette5[1],bluePalette5[4]),labels=c(">250m", "<250m"),name="Distance from\nProtected Bike Lane")+
labs(title= "Protected Bike Lane of Manhattan",
subtitle = "White : 250m buffer from protected bike lane",
caption = "Figure 3-5-a") +
mapThememin()+
theme(panel.background = element_blank(),
legend.key.size = unit(0.25, "cm"),
legend.key = element_rect(colour = NA, fill = NA)) st_drop_geometry(ride.panel) %>%
group_by(BikeLane) %>%
summarize(Mean_Trip_Count = mean(Trip_Count)) %>% na.omit() %>%
ggplot(aes(BikeLane, Mean_Trip_Count)) + geom_bar(stat = "identity", width = 0.2, aes(fill = BikeLane)) +
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
scale_fill_manual(values = c(bluePalette5[2],bluePalette5[4]),labels=c("> 250m", "< 250m"), name = "Distance from\nProtected Bike Lane")+
labs(title="Average count of bike trips per hours by distance from Pro",
subtitle ="by distance from Protected bike lane",
x="Distance from Bike Lane",
y="Mean Trip Count",
caption = "Figure 3-4-b") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))A total of four machine learning models were created using the various variables explored in the previous chapter. In this process, since bicycle stops are used as features, features of average trips of five nearest stations were excluded. The first model uses time, day, and temperature, the second model uses space(station), time, day, temperature, and the third model uses space(station), time, day, temperature, and precipitation, and the fourth model adds time lag characteristics to the third model. The last fifth model is added bike lane proximity feature.
##--------------------Space-Time regression model-----------------------------##
# week: 36, 37, 38, 39, 40
ride.panel <- ride.panel %>% mutate(BikeLane = case_when(BikeLane=="Near250m" ~ 0,
BikeLane=="Beyond250m" ~ 1))
ride.Train <- filter(ride.panel, week >= 38)
ride.Test <- filter(ride.panel, week < 38)
reg1 <- lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
reg2 <- lm(Trip_Count ~ start_station_name + dotw + Temperature, data=ride.Train)
reg3 <- lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation, data=ride.Train)
reg4 <- lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, data=ride.Train)
reg5 <- lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + BikeLane, data=ride.Train)
#reg5 <- lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
#lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, data=ride.Train)##--------------------Space-Time regression model prediction-----------------------------##
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_BikeLane = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions##--------------------Space-Time regression model prediction-----------------------------##
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity", width = 0.7) +
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
scale_fill_manual(values = blue2Palette5) +
labs(title = "Mean Absolute Errors by model specification and week",
subtitle = "Manhatten 2022.09.23-2022.10.07",
caption = "Figure 4-2-a") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2])) Figure 4-2-a is a histogram showing the weekly Mean Absolute Error of each model. Among the five models, the fourth and fifth model showed the highest Accuracy. The fourth and fifth models do not differ significantly, we can assume that fifth model is simply added features related to space. If we can add features that change over time, such as ridership of the nearest subway station by time, would affect the prediction model.
##--------------------Space-Time regression model prediction-----------------------------##
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id)) %>%
dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
scale_color_manual(values = c(bluePalette5[3], bluePalette5[4])) +
geom_line(size = 0.8) +
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted vs Observed bike share by models",
subtitle = "Manhatten 2022.09.23-2022.10.07",
x = "Hour",
y= "Station Trips",
caption = "Figure 4-2-b") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]),
legend.key = element_rect(colour = NA, fill = NA)) Figure 4-2-b is a graph comparing the prediction by hour using each model. The fourth model showed the most accurate figure, but it shows some error in the prediction when bike usage is peak. It seems that the above problem can be solved by weighting the time variable or by predicting the number of bicycles in more detail in 15 minutes rather than in an hour unit, but it is not applied in this study.
##--------------------Space-Time regression model prediction-----------------------------##
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_lat = map(data, pull, start_lat),
start_lng = map(data, pull, start_lng),
dotw = map(data, pull, dotw)) %>%
dplyr::select(interval60, start_station_id, start_lng,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("토", "일"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction), size = 0.3, color = bluePalette5[3])+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, colour = bluePalette5[5], size = 0.6)+
geom_abline(slope = 1, intercept = 0, color = greyPalette5[3], linetype = 2)+
facet_grid(time_of_day ~ weekend)+
labs(title="Observed vs Predicted",
subtitle = "Grey dashed line : Prefect line",
x="Observed trips",
y="Predicted trips",
caption = "Figure 4-2-c") +
corTheme2()##--------------------Space-Time regression model prediction-----------------------------##
week_predictions_D <- week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_lat = map(data, pull, start_lat),
start_lng = map(data, pull, start_lng),
dotw = map(data, pull, dotw) ) %>%
dplyr::select(interval60, start_station_id, start_lng,
start_lat, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_timeLags")%>%
mutate(weekend = ifelse(dotw %in% c("토", "일"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')
ggplot()+
geom_sf(data = NewyorkTracts, fill = greyPalette5[1], color = "transparent")+
geom_sf(data = week_predictions_D, aes(color = q5(MAE)), alpha = 0.6, size = 0.3) +
scale_colour_manual(values = bluePalette5, labels=qBr(week_predictions_D,"MAE"), name="MAE\n(quantile)")+
ylim(min(Bike_trip_data_20220903_20221007_tracts$start_lat), max(Bike_trip_data_20220903_20221007_tracts$start_lat))+
xlim(min(Bike_trip_data_20220903_20221007_tracts$start_lng), max(Bike_trip_data_20220903_20221007_tracts$start_lng))+
facet_grid(weekend ~ time_of_day)+
labs(title="MAE(Mean Absolute Error) of prediction",
subtitle="Manhatten 2022.09.23-2022.10.07",
caption = "Figure 4-2-d")+
mapThememin()+
theme(panel.background = element_blank(),
legend.key.size = unit(0.3, "cm"),
legend.key = element_rect(colour = NA, fill = NA)) ##--------------------Space-Time regression model prediction-----------------------------##
ride.panel_sample <- sample_n(ride.panel, 100000) %>% na.omit()
fitControl <- trainControl(method = "cv", number = 100)
set.seed(123)
reg.cv <- train(Trip_Count ~ ., data = ride.panel_sample %>% dplyr::select(Trip_Count, start_station_name, hour, dotw, Temperature, Precipitation, lagHour,
lag2Hours, lag3Hours, lag12Hours, lag1day),
method = "lm", trControl = fitControl, na.action = na.pass)
MAE.cv = data.table(MAE.cv= reg.cv$resample[,3])
MAE.sd <- as.character(round(sd(reg.cv$resample[,3]),2))
MAE.mean <- as.character(round(mean(reg.cv$resample[,3]),2))
caption <- paste("Sd =",MAE.sd, " Mean =",MAE.mean, " K = 100")
ggplot(MAE.cv, aes(x = MAE.cv))+
geom_histogram(fill=bluePalette5[3], color = "white", size = 1)+
geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
labs(title = "Cross-Validation MAE histogram",
subtitle = caption,
x="MAE",
y="Count",
caption = "Figure 4-3") +
corTheme2()+
theme(panel.background = element_blank(),
panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2])) Figure 5-3 is a histogram using the results of cross-validation by sampling only 100,000 records out of the total data of 551,880. We divided 100,000 data into 100 Folds and trained them to the model that showed the best results. The above histogram was obtained by counting according to the Mead of Absolute Error of each Fold. The more MAEs of each Fold are concentrated in, the better the generalizability of the model. So, it is difficult to say that it has secured a high level of generalizability mathematically as the minimum MAE and the maximum MAE differ by about 0.15. However, since the purpose of this study is to predict the Bike trip, the deviation between models of about 0.15 may not be significantly important.
The purpose of this study is to accurately predict the demand for bicycles at specific stops at specific times. Although the study is limited to 5 weeks, it has shown good results in terms of Accuracy. However, considering the fact that most of the problems with shared bicycles that occur in reality are during peak hours, and there is a part that needs to be improved in that most inaccurate predictions occurred during peak hours. Applying a time weight for peak hours or reducing the combined bike usage data to 15 minutes per hour will improve the accuracy of the forecast and improve the accuracy of the peak area. In addition, in order to improve the accuracy of this study, it is necessary to adopt empirical knowledge of human resources directly intervening in actual bicycle operations. Given the current trend of bicycle usage, there are significantly less-used stops, and if you check the actual field opinion on this part and exclude the stops judged as Outliers from the analysis, this will also be a good way to improve the model.