• 0. Introduction
  • 1. Setup
  • 2. Data Wrangling
    • 2-1. New York Bike Trip Data
    • 2-2. New York Bike Lane
    • 2-3. New York Weather Data
    • 2-4. Space-Time Panel
      • 3-1. Bike share trips
      • 3-2. Time lag features
      • 3-3. Spatial Autocorrelation
      • 3-4. weather correlation
      • 3-5. Bike Lane Correlation
        • 4-1. Space-Time Model
        • 4-2. Examine models
        • 4-3. Cross validation
  • 3. Explore the Data
    • 3-1. Bike share trips
    • 3-2. Time lag features
    • 3-3. Spatial Autocorrelation
    • 3-4. weather correlation
    • 3-5. Bike Lane Correlation
  • 4. Space-Time Model
    • 4-1. Space-Time Model
    • 4-2. Examine models
    • 4-3. Cross validation
  • 5. Conclusion

0. Introduction

One of the blind spots of the public bicycle policy is that passengers do not drive round-trip, but mostly one-way. New Yorkers often use shared bicycles during rush hour. After people use bicycles a lot, it often happens that there is no bicycle for others to use at the bicycle stop or there is no empty stand to return bicycle. In order to minimize the occurrence of this, various strategies are needed to redistribute bicycles at the right place and time. The first step to accomplish this mission is to accurately predict the demand for each stand. Due to physical limitations in calculating New York City’s annual bicycle usage, this study was limited to only for five weeks from September to December.

The demand for shared bicycles varies greatly depending on the time of the day. Therefore, this study focused on predicting the demand for bicycles used at stops every hour. The unit of 1 hour is not a big problem in conducting this study as a framework, but it is also a long time to accurately predict the peak of bicycle use during rush hours. Since most of New York City’s can be reached anywhere by bicycle in less than 15 minutes, subdivision of this 1 hour unit into 15 minutes will result in a higher resolution of predictions.

1. Setup

This is section for libraries, functions and overall styling settings.

# Rmarkdown global setting
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(fig.align = 'center')
knitr::opts_chunk$set(results = 'hide')
knitr::opts_chunk$set(fig.keep = 'all')

#----------------------------------------------------------------------------------------------------------  

# Import libraries
library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)# plot correlation plot
library(corrplot)
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(knitr)
library(rmarkdown)
library(RSocrata)
library(viridis)
library(ggplot2)
library(stargazer)
library(XML)
library(data.table)
library(ggpmisc)
library(patchwork)
library(spatstat)
library(raster)
library(classInt)   # for KDE and ML risk class intervals
library(tableHTML)
library(pscl)
library(plotROC)
library(pROC)
library(lubridate)
library(tigris)
library(riem)
library(RSocrata)
library(gganimate)
library(gifski)

#----------------------------------------------------------------------------------------------------------  

# Temp
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

# Etc
options(scipen=999)
options(tigris_class = "sf")

#----------------------------------------------------------------------------------------------------------  

# functions

st_c    <- st_coordinates
st_coid <- st_centroid

mapThememin <- function(base_size = 9, title_size = 12, small_size = 8) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
    plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
    plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.text.x = element_text(size = small_size, face="italic"),
    strip.text.y = element_text(size = small_size, face="italic"),
    strip.background = element_rect(colour="transparent", fill="transparent"),
    legend.title = element_text(size = small_size),
    legend.text = element_text(size = small_size),
    legend.key.size = unit(0.4, "cm"))
}

mapThememin2 <- function(base_size = 8, title_size = 10, small_size = 6) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
    plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
    plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.text.x = element_text(size = base_size),
    strip.text.y = element_text(size = base_size),
    strip.background = element_rect(colour="transparent", fill="transparent"),
    legend.title = element_text(size = small_size),
    legend.text = element_text(size = small_size),
    legend.key.size = unit(0.25, "cm"))
}


corTheme <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_blank(), 
        axis.ticks = element_blank(), 
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        strip.background = element_rect(colour="transparent", fill="transparent"))
}


corTheme2 <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        strip.background = element_rect(colour="transparent", fill="transparent"),
        strip.text.x = element_text(size = small_size, face="italic"),
        strip.text.y = element_text(size = small_size, face="italic"),
        legend.text = element_text(size = small_size),
        legend.title = element_text(size = small_size),
        legend.key.size = unit(0.4, "cm"))
}

corTheme3 <- function(base_size = 9, title_size = 11, small_size = 7){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
        legend.text = element_text(size = small_size))
}

corTheme4 <- function(base_size = 9, title_size = 11, small_size = 7){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y.right = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}


q5 <- function(variable) {as.factor(ntile(variable, 5))}

q <- function(variable) {as.factor(ntile(variable, 5))}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

qBr2 <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]]*100,0)/100,
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

qBr3 <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(round(quantile(round(df[[variable]]*1000000,0),
                          c(.01,.2,.4,.6,.8), na.rm=T)),0)
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}


nn_function <- function(measureFrom,measureTo,k) {
  measureFrom_Matrix <- as.matrix(measureFrom)
  measureTo_Matrix <- as.matrix(measureTo)
  nn <- get.knnx(measureTo, measureFrom, k)$nn.dist
  output <- as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  return(output)  
}


myCrossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, geometry, indVariables, dependentVariable)
    
    regression <-
      glm(countMVTheft ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-geometry, -id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(st_sf(allPredictions))
}

#----------------------------------------------------------------------------------------------------------  

# Colors ("https://coolors.co/gradient-palette/a8f368-f9035e?number=7")
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
H_bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#FF440C")
orangePalette5 <- c("#FFF2E8","#FFD6B6","#FEB984","#FE9D51","#FD801F")
orange2Palette5 <- c("#FFDFD0","#FFB89F","#FF926E","#FF6B3D","#FF440C")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")
greenPalette5 <- c("#edf8e9","#bae4b3","#74c476","#31a354","#006d2c")
purplePalette5 <- c("#f2f0f7","#cbc9e2","#9e9ac8","#756bb1","#54278f")
dotwPalette7 <- c("#FD801F","#9ecae1","#6baed6","#4292c6","#2171b5","#084594","#FEB984")


#----------------------------------------------------------------------------------------------------------  

# LoadAPI(Min's key)
census_api_key("4bbe4bead4e5817f6a6b79e62c5bea69e77f1887", overwrite = TRUE)

2. Data Wrangling

2-1. New York Bike Trip Data

##-------------------- NYC Bike trip data-----------------------------##

Bike_trip_data_202209 <- read.csv("C:/Users/vestalk/Desktop/00_Upenn/20.Fall/04.MUSA 5080 Public Policy Analytics/Assignment/06.Bike Share Prediction/00.DATA/202209_biketrip.csv") %>% dplyr::select(-X)

Bike_trip_data_202210 <- read.csv("C:/Users/vestalk/Desktop/00_Upenn/20.Fall/04.MUSA 5080 Public Policy Analytics/Assignment/06.Bike Share Prediction/00.DATA/202210_biketrip.csv") %>% dplyr::select(-X)


Bike_trip_data_202209_10 <- rbind(Bike_trip_data_202209, Bike_trip_data_202210)
  
# sampling Data 40%

Bike_trip_data_202209_10_sample <- sample_n(Bike_trip_data_202209_10, length(Bike_trip_data_202209_10$ride_id)*0.4)

Bike_trip_data_20220903_20221007 <- Bike_trip_data_202209_10_sample %>%  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
                                                                                interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
                                                                                day = floor_date(ymd_hms(started_at), unit = "day"),
                                                                                week = week(interval60),
                                                                                dotw = wday(interval60, Sys.setenv("LANGUAGE" = "es"))) %>%
  filter(week == 36 | week == 37 | week == 38 | week == 39 | week == 40) %>% arrange(started_at)

Bike_trip_data_20220903_20221007_full <- Bike_trip_data_202209_10 %>%  mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
                                                                                interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
                                                                                day = floor_date(ymd_hms(started_at), unit = "day"),
                                                                                week = week(interval60),
                                                                                dotw = wday(interval60, Sys.setenv("LANGUAGE" = "es"))) %>%
  filter(week == 36 | week == 37 | week == 38 | week == 39 | week == 40) %>% arrange(started_at)


glimpse(Bike_trip_data_20220903_20221007)

length(Bike_trip_data_20220903_20221007)

rm(Bike_trip_data_202209)
rm(Bike_trip_data_202210)
rm(Bike_trip_data_202209_10)
rm(Bike_trip_data_202209_10_sample)
##-------------------- NYC Census data-----------------------------##

NewyorkCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2019,
          state = 36, 
          geometry = TRUE, 
          county=c("61"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

NewyorkTracts <- 
  NewyorkCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf %>% 
  st_transform('EPSG:2263')
##--------------------NYC Bike trip data combined with census tracts-----------------------------##

## study code later in detail ##

Bike_trip_data_20220903_20221007_tracts_full <- st_join(Bike_trip_data_20220903_20221007_full %>%
                                                     filter(is.na(start_lng) == FALSE &
                                                            is.na(start_lat) == FALSE &
                                                            is.na(end_lat) == FALSE &
                                                            is.na(end_lng) == FALSE) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4236)%>%
                                                     st_transform('EPSG:2263'),
                                                   NewyorkTracts%>% st_transform('EPSG:2263'), 
                                                   join=st_intersects) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4236)%>% st_transform('EPSG:2263') %>%
  st_join(., NewyorkTracts %>% st_transform(crs=2263),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lng = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry) %>% na.omit()


Bike_trip_data_20220903_20221007_tracts <- st_join(Bike_trip_data_20220903_20221007 %>%
                                                     filter(is.na(start_lng) == FALSE &
                                                              is.na(start_lat) == FALSE &
                                                              is.na(end_lat) == FALSE &
                                                              is.na(end_lng) == FALSE) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4236)%>%
                                                     st_transform('EPSG:2263'),
                                                   NewyorkTracts%>% st_transform('EPSG:2263'), 
                                                   join=st_intersects) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lng = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lng", "end_lat"), crs = 4236)%>% st_transform('EPSG:2263') %>%
  st_join(., NewyorkTracts %>% st_transform(crs=2263),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lng = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry) %>% na.omit()

station_number <- nrow(Bike_trip_data_20220903_20221007_tracts_full[!duplicated(Bike_trip_data_20220903_20221007_tracts_full$start_station_id),])

station_trip_sum <- Bike_trip_data_20220903_20221007_tracts_full %>% group_by(start_station_id) %>% summarise(total_trip = n(),
                                                                                                              start_lng = first(start_lng),
                                                                                                              start_lat = first(start_lat),
                                                                                                              total_trip_week = n()/5) %>%
   st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263') 

station_number <- nrow(Bike_trip_data_20220903_20221007_tracts_full[!duplicated(Bike_trip_data_20220903_20221007_tracts_full$start_station_id),])
sum(station_trip_sum$total_trip)

mean(station_trip_sum$total_trip_week)

ggplot() + 
  geom_sf(data = NewyorkTracts, fill = greyPalette5[2], color = "transparent") +
  geom_sf(data = station_trip_sum, aes(color = q5(total_trip)), size=0.6) +
  scale_color_brewer(palette = "Blues",labels = round(as.numeric(qBr(station_trip_sum, "total_trip")))) +
  labs(title= "Bike share trips of Manhattan",
       subtitle = "674 stations \n 2,563,358 Trips \n 2022-09-03 ~ 2022-10-07",
       caption = "Figure 2-1", 
       color = 'Total Trips\n(quantile)') +
  mapThememin()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.25, "cm"),
        legend.key = element_rect(colour = NA, fill = NA)) 

Figure 2-1-a shows total bike share trips of each station during five weeks long research period, 2022/09/03~2022/10/07. We can see clear spatially clustered pattern of bike share usage showing high usage rate on center area, on the other hand, the northern part of Manhattan shows significantly low usage rate for some reasons.

ggplot(station_trip_sum)+
  geom_histogram(aes(total_trip_week), binwidth =100, fill = bluePalette5[3], color = "white", size = 2)+
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  labs(title="Bike share trips per week of each station",
       subtitle = "average 760 trips per week",
       x="Trip Counts", 
       y="Number of Stations",
       caption = "Figure 2-1-b") +
      corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "dashed", size = 0.25, color = greyPalette5[2]))

rm(Bike_trip_data_20220903_20221007_tracts_full)
rm(Bike_trip_data_20220903_20221007_full)
rm(Bike_trip_data_202209_10)

Looking at Figure 2-1-b, which shows the distribution of bicycle stops by the average trip counts per week, it is skewed to the left. This indicates that a considerable number of bicycle stops are not being used. In other words, some of bicycle stops quite well used. Our aim is to redistribute the appropriate level of bicycles to each stop so that shared bicycles can be used well. If the prediction for bicycle stops with Trip Counts above the average is accurate, it can be said that a reasonable model was designed.

2-2. New York Bike Lane

NY_bike_lane <- 
  st_read("https://data.cityofnewyork.us/api/geospatial/7vsa-caz7?method=export&format=GeoJSON") %>% 
  st_transform('EPSG:2263') %>% st_as_sf() %>% filter(allclasses == "I")

MH_bike_lane <- st_intersection(NY_bike_lane, NewyorkTracts %>% st_transform('EPSG:2263'))

summary(MH_bike_lane)
sum(as.numeric(MH_bike_lane$shape_le_1))

MH_bike_lane_Bf <- 
  st_union(st_buffer(MH_bike_lane, 820)) %>% #250m buffer
  st_sf() %>% 
  st_transform('EPSG:2263')

ggplot() + 
  geom_sf(data = NewyorkTracts, fill = bluePalette5[1], color = "transparent") +
  geom_sf(data = station_trip_sum, color = bluePalette5[5], fill = bluePalette5[4], size =0.3) +
  geom_sf(data = MH_bike_lane, color = bluePalette5[3], size=0.8) +
  labs(title= "Protected Bike Lane of Manhattan",
       subtitle = "Total 801,178 feet",
       caption = "Figure 2-2") +
  mapThememin()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.25, "cm"),
        legend.key = element_rect(colour = NA, fill = NA)) 

Bike_trip_data_20220903_20221007_tracts <- 
    rbind(st_as_sf(Bike_trip_data_20220903_20221007_tracts, coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')[MH_bike_lane_Bf,] %>%
            st_drop_geometry() %>% left_join(Bike_trip_data_20220903_20221007_tracts) %>%  mutate(BikeLane = "Near250m"),
          st_as_sf(Bike_trip_data_20220903_20221007_tracts, coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')[MH_bike_lane_Bf, op = st_disjoint] %>%
            st_drop_geometry() %>% left_join(Bike_trip_data_20220903_20221007_tracts) %>% mutate(BikeLane = "Beyond250m")) 

Figure 2-2 shows the current status of Protected Lane on New York City bike lanes. A total of 801,178 ft is currently installed, and you can see that most of the Protected Lane is concentrated south of Manhattan.

2-3. New York Weather Data

New York climate data was collected using Riem_measures. The weather station we used is located in Central Park and have observed climate data since 1943. More information is available at https://mesonet.agron.iastate.edu/request/download.phtml?network=NY_ASOS.

##--------------------NYC weather Data-----------------------------##

weather.Panel <- 
  riem_measures(station = "NYC", date_start = "2022-09-03", date_end = "2022-10-07") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)

weather.Panel.gather <- weather.Panel %>% gather(Variable, Value, -interval60)

ggplot(weather.Panel.gather, aes(interval60, Value)) + geom_line(color = bluePalette5[4], size = 0.5) +
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
    facet_wrap(~Variable, ncol = 1, scales = "free")+
      labs(title = "New York Weather Data",
           subtitle = "2022-09-03 ~ 2022-10-07",
           x = "",
           y = "",
           caption = "Figure 2-2") +
      corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "dashed", size = 0.25, color = greyPalette5[2]))

rm(weather.Panel.gather)

2-4. Space-Time Panel

The Space Time Panel is one of the most important parts of this study. Simply put, it’s an hourly table of how many bicycles were used for 24 hours at each bicycle stop. There are a total of 660 bicycle stops, and a total of 840 (35*24) time matrices are created if a total of 35 days are counted as each time for five weeks. Multiplying this by 660 makes up a Space-Time Panel consisting of a total of 554,400 rows. The basic structure of predicting what the demand for bicycles will be for each hour and for each bicycle stop is completed. In addition to this, we will configure Time-lag Feature and use it for the prediction model.

##--------------------Build Space-Time Panel-----------------------------##

## study code later in detail ##


study.panel <- 
  expand.grid(interval60=unique(Bike_trip_data_20220903_20221007_tracts$interval60), 
              start_station_id = unique(Bike_trip_data_20220903_20221007_tracts$start_station_id)) %>%
  left_join(., Bike_trip_data_20220903_20221007_tracts %>%
              dplyr::select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat)%>%
              distinct() %>%
              group_by(start_station_id) %>%
              slice(1))


length(unique(Bike_trip_data_20220903_20221007_tracts$start_station_id))
length(unique(Bike_trip_data_20220903_20221007_tracts$interval60))
length(unique(Bike_trip_data_20220903_20221007_tracts$day))*24

nrow(study.panel)     

Bikelane_stationid <- Bike_trip_data_20220903_20221007_tracts %>% dplyr::select(start_station_id, BikeLane)

Bikelane_stationid <- Bikelane_stationid[!duplicated(Bikelane_stationid$start_station_id),]

ride.panel <- 
  Bike_trip_data_20220903_20221007_tracts %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station_id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE) %>%
  left_join(Bikelane_stationid, by = "start_station_id")

ride.panel <- 
  left_join(ride.panel, NewyorkCensus %>% dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

##--------------------add time lags-----------------------------##

ride.panel <- 
  ride.panel %>% 
  arrange(start_station_id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 248,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
          holidayLag = replace_na(holidayLag, "0")) %>%
  mutate(hour = hour(interval60))

length(unique(ride.panel$start_station_id))

#yday(as.Date("2022-09-05")) labor day

3. Explore the Data

3-1. Bike share trips

Bike_trips_per_hour <- Bike_trip_data_20220903_20221007_tracts %>% group_by(interval60) %>% tally()

Bike_trips_per_day <- Bike_trip_data_20220903_20221007_tracts %>% group_by(day) %>% tally()
mean(Bike_trips_per_day$n)

Bike_trips_per_week <- Bike_trip_data_20220903_20221007_tracts %>% group_by(week) %>% tally()
mean(Bike_trips_per_week$n)


mondays <- mutate(ride.panel, monday = ifelse(dotw == "월" & hour(interval60) == 1, interval60, 0)) %>% filter(monday != 0) 

## Bike share Trips per hour

gp31 <- ggplot(Bike_trips_per_hour)+
  geom_line(aes(x = interval60, y = n), color = bluePalette5[4], size = 0.5)+
  geom_vline(data = mondays, aes(xintercept = monday), linetype = "longdash", size = 0.35, color = greyPalette5[2]) +
  labs(title="Bike share trips per hour",
       subtitle = "Manhatten 2022.09.03-2022.10.07",
       x="hour", 
       y="Number of trips",
       caption = "Figure 3-1-a")+
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))  

## Bike share Trips per date

gp32 <- ggplot(Bike_trips_per_day)+
  geom_line(aes(x = day, y = n), color = bluePalette5[4], size = 0.5)+
  geom_hline(yintercept = 29311, color = greyPalette5[3], size = 0.4, linetype = "longdash")+
  geom_vline(data = mondays, aes(xintercept = monday), linetype = "longdash", size = 0.35, color = greyPalette5[2]) +
  labs(title="Bike share trips per day",
       subtitle = "Average per day : 29,311 trips",
       x="date", 
       y="Number of trips",
       caption = "Figure 3-1-b")+
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))  

## Bike share Trips per week

gp33 <- ggplot(Bike_trips_per_week)+
  geom_line(aes(x = week, y = n), color = bluePalette5[4], size = 0.5)+
  geom_hline(yintercept = 205183, color = greyPalette5[3], size = 0.4, linetype = "longdash")+
  scale_y_continuous(position = "right")+
  labs(title="Bike share trips per week",
       subtitle =  "Average per week : 205,183 trips",
       x="week", 
       y="Number of trips",
       caption = "Figure 3-1-c")+
  geom_hline(yintercept = 100000, size = 0.25, color = greyPalette5[3], linetype = "longdash") +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]),
        panel.grid.major.x = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))  

grid.arrange(gp31, arrangeGrob(gp32, gp33, ncol=2), nrow = 2)

Figures 3-1-a, b, and c show the pattern of bike share trips by time, day, and week. Figure 3-1-a shows that the Bike share trip has a peak time on a daliy basis. Figures 3-1-b,c did not show a particular pattern. This is because the current survey period is limited to 5 weeks, but if this is changed to a wider period, a meaningful pattern may come out.

Bike_trip_data_20220903_20221007_tracts <- Bike_trip_data_20220903_20221007_tracts %>% mutate(hour = hour(started_at))

gp34 <- ggplot(Bike_trip_data_20220903_20221007_tracts)+
  geom_freqpoly(aes(hour, colour = dotw), binwidth = 1, size = 0.5)+
  scale_color_manual(values = dotwPalette7,
                     labels = c("SUN","MON","TUE","WED","THU","FRI","SAT"),
                     name = "Day of the week")+
  labs(title="Bike share trips : day of the week",
       x="Hour", 
       y="Trip Counts")+
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]),
        legend.key = element_rect(colour = NA, fill = NA),
        axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank(),
        plot.title = element_text(size = 11, colour = "black", hjust = 0.5))  

#dotw_test <- Bike_trip_data_20220903_20221007_tracts %>% select()

table(Bike_trip_data_20220903_20221007_tracts['dotw'])

Bike_trip_data_20220903_20221007_tracts <- Bike_trip_data_20220903_20221007_tracts  %>% mutate(weekend = ifelse(dotw %in% c("토", "일"), "Weekend", "Weekday"))

hour_breaks <- data.frame(start = c(-1, 7, 10, 15, 19),  # Create data with breaks
                          end = c(7, 10, 15, 19, 24),
                          colors =c("Overnight", "AM Rush", "Mid Day", "PM Rush", "Overnight"))

gp35 <- ggplot(Bike_trip_data_20220903_20221007_tracts)+                      
  geom_rect(data = hour_breaks,  aes(xmin = start, xmax = end, ymin = - Inf, ymax = Inf, fill = colors),alpha = 0.2)+
  scale_fill_manual(values = c(bluePalette5[1],bluePalette5[2],bluePalette5[2],bluePalette5[1],bluePalette5[2]),
                    name = "Time of the day")+
  geom_freqpoly(aes(hour, color = weekend), binwidth = 1, size = 0.5)+
  scale_color_manual(values = c("#084594","#FD801F"))+
  labs(title="Bike share trips : weekend vs weekday",
       x="Hour", 
       y="Trip Counts",
       caption = "Figure 3-1-d") +
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  corTheme2() +
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]),
        legend.key = element_rect(colour = NA, fill = NA),
        plot.title = element_text(size = 11, colour = "black", hjust = 0.5)) 

grid.arrange(gp34, gp35, nrow = 2)

Figure 3-1-d is a significant graph showing that the purpose of bicycle use differs significantly on weekdays and weekends. During the week, many New Yorkers commute by bicycles, on the contrary during weekends, most of bicycles are used for leisure purposes on Mid-day.

time_of_day_trips <- 
  Bike_trip_data_20220903_20221007_tracts %>% group_by(start_station_id) %>%
  mutate(start_lat = first(start_lat),
         start_lng = first(start_lng)) %>% ungroup() %>%
  mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
  group_by(start_station_id, start_lat, start_lng, weekend, time_of_day) %>%
  tally() %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')

time_of_day_trips$time_of_day <- factor(time_of_day_trips$time_of_day, levels = c("AM Rush", "Mid-Day", "PM Rush", "Overnight"))

ggplot()+
  geom_sf(data = NewyorkTracts, fill = greyPalette5[1], color = "transparent")+
  geom_sf(data = time_of_day_trips, aes(color = q5(n)), alpha = 0.6, size = 0.3) +
  scale_colour_manual(values = H_bluePalette5,labels=round(as.numeric(qBr(time_of_day_trips,"n"))),name="Trips\n(quantile)")+
  ylim(min(Bike_trip_data_20220903_20221007_tracts$start_lat), max(Bike_trip_data_20220903_20221007_tracts$start_lat))+
  xlim(min(Bike_trip_data_20220903_20221007_tracts$start_lng), max(Bike_trip_data_20220903_20221007_tracts$start_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per time of the day",
       caption = "Figure 3-1-e")+
  mapThememin()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.3, "cm"),
        legend.key = element_rect(colour = NA, fill = NA)) 

glimpse(Bike_trip_data_20220903_20221007_tracts)

Figure 3-1-e describes how many trips occurred at each bicycle stop based on Time of the day and Weekend. In Figure 3-1-d, only the correlation with time was confirmed, and in Figure 3-1-e, it is possible to confirm how it is spatially related.

3-2. Time lag features

##--------------------Time lagged features correlation-----------------------------##

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 38) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))

correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2)) 

ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 0.3, color = bluePalette5[3])  +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.2, size = 3, color = bluePalette5[5]) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 0.6) +
  facet_wrap(~Variable, ncol = 7, scales = "free") +
  labs(title = "Bike share trips as a function of time lags",
       subtitle = "One week in September, 2022",
       caption = "Figure 3-2",
       x = "Lagged Trip Counts",
       y = "Trip Counts") +
  corTheme()

3-3. Spatial Autocorrelation

##--------------------Spatial autocorrelation-----------------------------##

Week_trips <- ride.panel %>% group_by(week, Origin.Tract) %>%
  summarise(Sum_Trip_Count = sum(Trip_Count)) %>% ungroup() %>% left_join(NewyorkTracts,  by = c("Origin.Tract" = "GEOID")) %>%
  st_sf()

dotw_trips <- ride.panel %>% group_by(dotw, Origin.Tract) %>%
  summarise(Sum_Trip_Count = sum(Trip_Count)) %>% ungroup() %>% left_join(NewyorkTracts,  by = c("Origin.Tract" = "GEOID")) %>%
  st_sf()


ggplot() + geom_sf(data = Week_trips, aes(fill = q5(Sum_Trip_Count)), color = greyPalette5[1], size = 0.2) +
    scale_fill_manual(values = bluePalette5,
                      labels = qBr(Week_trips, "Sum_Trip_Count"),
                      name = "Trip Count") +
    facet_wrap(~week, ncol = 5) +
    labs(title = "Sum of Bike Share Trips",
         subtitle = "by week",
         caption = "Figure 3-3-a") +
  corTheme()+
  theme(panel.background = element_blank(),
        legend.position = "bottom",
        legend.key.size = unit(0.3, "cm")) 

rename <- c(`` = "SUN",`` = "MON",``="TUE",``="WED",``="THU",``="FRI",``="SAT")

ggplot() + geom_sf(data = dotw_trips, aes(fill = q5(Sum_Trip_Count)), color = greyPalette5[1], size = 0.2) +
    scale_fill_manual(values = bluePalette5,
                      labels = qBr(dotw_trips, "Sum_Trip_Count"),
                      name = "Trip Count") +
    facet_wrap(~dotw, ncol = 7, labeller = as_labeller(rename)) +
    labs(title = "Sum of Bike Share Trips",
         subtitle = "by day of the week",
         caption = "Figure 3-3-b") +
  corTheme()+
  theme(panel.background = element_blank(),
        legend.position = "bottom",
        legend.key.size = unit(0.3, "cm")) 

rm(Week_trips)
rm(dotw_trips)

Figure 3-3-a, b shows the spatial autocorrelation of the Bike Trip. The 40th week has a slightly different aspect, but you can see that most of the trips are concentrated in the east and west of Manhattan. There is not much difference in these usage patterns even if you check them by day of the week. This may be related to the distribution of New York’s bicycle infrastructure.

##-----------animated map------------##

week38 <-
  filter(Bike_trip_data_20220903_20221007_tracts, week == 38 & dotw == "월")

week38.panel <-
  expand.grid(
    interval15 = unique(week38$interval15),
    Origin.Tract = unique(Bike_trip_data_20220903_20221007_tracts$Origin.Tract))

ride.animation.data <-
  mutate(week38, Trip_Counter = 1) %>%
    right_join(week38.panel) %>% 
    group_by(interval15, Origin.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(NewyorkTracts, by=c("Origin.Tract" = "GEOID")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
                             Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
                             Trip_Count > 10 ~ "11+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
                                       "7-10 trips","10+ trips"))

rideshare_animation <-
  ggplot() +
    geom_sf(data = ride.animation.data, aes(fill = Trips), color = "transparent") +
    scale_fill_manual(values = bluePalette5) +
    labs(title = "Bike Trips for one day in Sep 2022",
         subtitle = "15 minute intervals: {current_frame}") +
    transition_manual(interval15) +
  corTheme()+
  theme(panel.background = element_blank(),
        legend.position = "bottom",
        legend.key.size = unit(0.3, "cm")) 

rideshare_animation_plot <- animate(rideshare_animation, duration=30, renderer = gifski_renderer())

3-4. weather correlation

##--------------------Average count of bike trips by precipitation-----------------------------##

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>% na.omit() %>%
    ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity", width = 0.2, aes(fill = isPercip)) +
    geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  scale_fill_manual(values = c(bluePalette5[2],bluePalette5[4]),name = "precipitation")+
  labs(title="Average count of bike trips by precipitation",
       subtitle ="Manhatten 2022.09.03-2022.10.07",
       x="Precipitation",
       y="Mean Trip Count",
       caption = "Figure 3-4-a") +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))

Figure 3-4-a is a diagram showing the relationship between the weather and bicycle usage. It is very obvious that usage falls below 50% during the rainy or snowy period.

##--------------------Correlation btw mean trip count and temperature-----------------------------##

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point(size = 0.3, color = bluePalette5[3]) +
    geom_smooth(method = "lm", se= FALSE, colour = bluePalette5[5], size = 0.6) +
    facet_wrap(~week, ncol=8, scales = "free") + 
    labs(title="Trip Count as a fuction of Temperature by week",
         subtitle = "Manhatten 2022.09.03-2022.10.07",
         x="Temperature",
         y="Mean Trip Count",
         caption = "Figure 3-4-b")+
  corTheme2()

Figure 3-4-b shows the relationship between temperature and Bike share Trip. During the study period, we can confirm that the higher the temperature, the higher the number of bicycle rentals.

3-5. Bike Lane Correlation

unique_bike_station <- Bike_trip_data_20220903_20221007_tracts[!duplicated(Bike_trip_data_20220903_20221007_tracts$start_station_id),] %>%
  st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')

ggplot() + 
  geom_sf(data = NewyorkTracts, fill = greyPalette5[2], color = "transparent") +
  geom_sf(data = MH_bike_lane_Bf, fill = greyPalette5[1], color = greyPalette5[3], size = 0.1) +
  geom_sf(data = unique_bike_station, aes(color = BikeLane), size =0.3) +
  scale_colour_manual(values = c(bluePalette5[1],bluePalette5[4]),labels=c(">250m", "<250m"),name="Distance from\nProtected Bike Lane")+
  labs(title= "Protected Bike Lane of Manhattan",
       subtitle = "White : 250m buffer from protected bike lane",
       caption = "Figure 3-5-a") +
  mapThememin()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.25, "cm"),
        legend.key = element_rect(colour = NA, fill = NA)) 

st_drop_geometry(ride.panel) %>%
  group_by(BikeLane) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>% na.omit() %>%
    ggplot(aes(BikeLane, Mean_Trip_Count)) + geom_bar(stat = "identity", width = 0.2, aes(fill = BikeLane)) +
    geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  scale_fill_manual(values = c(bluePalette5[2],bluePalette5[4]),labels=c("> 250m", "< 250m"), name = "Distance from\nProtected Bike Lane")+
  labs(title="Average count of bike trips per hours by distance from Pro",
       subtitle ="by distance from Protected bike lane",
       x="Distance from Bike Lane",
       y="Mean Trip Count",
       caption = "Figure 3-4-b") +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))

4. Space-Time Model

4-1. Space-Time Model

A total of four machine learning models were created using the various variables explored in the previous chapter. In this process, since bicycle stops are used as features, features of average trips of five nearest stations were excluded. The first model uses time, day, and temperature, the second model uses space(station), time, day, temperature, and the third model uses space(station), time, day, temperature, and precipitation, and the fourth model adds time lag characteristics to the third model. The last fifth model is added bike lane proximity feature.

##--------------------Space-Time regression model-----------------------------##

# week: 36, 37, 38, 39, 40

ride.panel <- ride.panel %>% mutate(BikeLane = case_when(BikeLane=="Near250m" ~ 0,
                                                         BikeLane=="Beyond250m" ~ 1))

ride.Train <- filter(ride.panel, week >= 38)
ride.Test <- filter(ride.panel, week < 38)

reg1 <- lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- lm(Trip_Count ~  start_station_name + dotw + Temperature,  data=ride.Train)

reg3 <- lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation, data=ride.Train)

reg4 <- lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
             lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, data=ride.Train)

reg5 <- lm(Trip_Count ~  start_station_name +  hour(interval60) + dotw + Temperature + Precipitation +
           lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + BikeLane, data=ride.Train)

#reg5 <- lm(Trip_Count ~  start_station_name + hour(interval60) + dotw + Temperature + Precipitation +
             #lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, data=ride.Train)

4-2. Examine models

##--------------------Space-Time regression model prediction-----------------------------##


ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_BikeLane = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
##--------------------Space-Time regression model prediction-----------------------------##

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity", width = 0.7) +
    geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
    scale_fill_manual(values = blue2Palette5) +
    labs(title = "Mean Absolute Errors by model specification and week",
         subtitle = "Manhatten 2022.09.23-2022.10.07",
         caption = "Figure 4-2-a") +
  corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))  

Figure 4-2-a is a histogram showing the weekly Mean Absolute Error of each model. Among the five models, the fourth and fifth model showed the highest Accuracy. The fourth and fifth models do not differ significantly, we can assume that fifth model is simply added features related to space. If we can add features that change over time, such as ridership of the nearest subway station by time, would affect the prediction model.

##--------------------Space-Time regression model prediction-----------------------------##

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id)) %>%
    dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) +
    scale_color_manual(values = c(bluePalette5[3], bluePalette5[4])) +
      geom_line(size = 0.8) + 
      geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted vs Observed bike share by models",
           subtitle = "Manhatten 2022.09.23-2022.10.07",
           x = "Hour",
           y= "Station Trips",
           caption = "Figure 4-2-b") +
      corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]),
        legend.key = element_rect(colour = NA, fill = NA))  

Figure 4-2-b is a graph comparing the prediction by hour using each model. The fourth model showed the most accurate figure, but it shows some error in the prediction when bike usage is peak. It seems that the above problem can be solved by weighting the time variable or by predicting the number of bicycles in more detail in 15 minutes rather than in an hour unit, but it is not applied in this study.

##--------------------Space-Time regression model prediction-----------------------------##

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station_id = map(data, pull, start_station_id), 
           start_lat = map(data, pull, start_lat), 
           start_lng = map(data, pull, start_lng),
           dotw = map(data, pull, dotw)) %>%
    dplyr::select(interval60, start_station_id, start_lng, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("토", "일"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
  ggplot()+
    geom_point(aes(x= Observed, y = Prediction), size = 0.3, color = bluePalette5[3])+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, colour = bluePalette5[5], size = 0.6)+
    geom_abline(slope = 1, intercept = 0, color = greyPalette5[3], linetype = 2)+
  facet_grid(time_of_day ~ weekend)+
  labs(title="Observed vs Predicted",
       subtitle = "Grey dashed line : Prefect line",
       x="Observed trips", 
       y="Predicted trips",
       caption = "Figure 4-2-c") +
  corTheme2()

##--------------------Space-Time regression model prediction-----------------------------##

week_predictions_D <- week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station_id = map(data, pull, start_station_id), 
         start_lat = map(data, pull, start_lat), 
         start_lng = map(data, pull, start_lng),
         dotw = map(data, pull, dotw) ) %>%
  dplyr::select(interval60, start_station_id, start_lng, 
         start_lat, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "DTime_Space_FE_timeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("토", "일"), "Weekend", "Weekday"),
       time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                               hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                               hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                               hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
  group_by(start_station_id, weekend, time_of_day, start_lng, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE)) %>% st_as_sf(., coords = c("start_lng", "start_lat"), crs = 'EPSG:2263')


ggplot()+
  geom_sf(data = NewyorkTracts, fill = greyPalette5[1], color = "transparent")+
  geom_sf(data = week_predictions_D, aes(color = q5(MAE)), alpha = 0.6, size = 0.3) +
  scale_colour_manual(values = bluePalette5, labels=qBr(week_predictions_D,"MAE"), name="MAE\n(quantile)")+
  ylim(min(Bike_trip_data_20220903_20221007_tracts$start_lat), max(Bike_trip_data_20220903_20221007_tracts$start_lat))+
  xlim(min(Bike_trip_data_20220903_20221007_tracts$start_lng), max(Bike_trip_data_20220903_20221007_tracts$start_lng))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="MAE(Mean Absolute Error) of prediction",
       subtitle="Manhatten 2022.09.23-2022.10.07",
       caption = "Figure 4-2-d")+
  mapThememin()+
  theme(panel.background = element_blank(),
        legend.key.size = unit(0.3, "cm"),
        legend.key = element_rect(colour = NA, fill = NA)) 

4-3. Cross validation

##--------------------Space-Time regression model prediction-----------------------------##


ride.panel_sample <- sample_n(ride.panel, 100000) %>% na.omit()

fitControl <- trainControl(method = "cv", number = 100)

set.seed(123)

reg.cv <- train(Trip_Count ~ ., data = ride.panel_sample %>% dplyr::select(Trip_Count, start_station_name, hour, dotw, Temperature,  Precipitation, lagHour,
                                                                    lag2Hours, lag3Hours, lag12Hours, lag1day),
                method = "lm", trControl = fitControl, na.action = na.pass)


MAE.cv = data.table(MAE.cv= reg.cv$resample[,3])


MAE.sd <- as.character(round(sd(reg.cv$resample[,3]),2))
MAE.mean <- as.character(round(mean(reg.cv$resample[,3]),2))

caption <- paste("Sd =",MAE.sd, "  Mean =",MAE.mean, " K = 100")

ggplot(MAE.cv, aes(x = MAE.cv))+
  geom_histogram(fill=bluePalette5[3], color = "white", size = 1)+
  geom_hline(yintercept = 0, size = 0.25, color = greyPalette5[3]) +
  labs(title = "Cross-Validation MAE histogram",
       subtitle = caption,
       x="MAE",
       y="Count",
       caption = "Figure 4-3") +
      corTheme2()+
  theme(panel.background = element_blank(),
        panel.grid.major.y = element_line(linetype = "longdash", size = 0.25, color = greyPalette5[2]))  

Figure 5-3 is a histogram using the results of cross-validation by sampling only 100,000 records out of the total data of 551,880. We divided 100,000 data into 100 Folds and trained them to the model that showed the best results. The above histogram was obtained by counting according to the Mead of Absolute Error of each Fold. The more MAEs of each Fold are concentrated in, the better the generalizability of the model. So, it is difficult to say that it has secured a high level of generalizability mathematically as the minimum MAE and the maximum MAE differ by about 0.15. However, since the purpose of this study is to predict the Bike trip, the deviation between models of about 0.15 may not be significantly important.

5. Conclusion

The purpose of this study is to accurately predict the demand for bicycles at specific stops at specific times. Although the study is limited to 5 weeks, it has shown good results in terms of Accuracy. However, considering the fact that most of the problems with shared bicycles that occur in reality are during peak hours, and there is a part that needs to be improved in that most inaccurate predictions occurred during peak hours. Applying a time weight for peak hours or reducing the combined bike usage data to 15 minutes per hour will improve the accuracy of the forecast and improve the accuracy of the peak area. In addition, in order to improve the accuracy of this study, it is necessary to adopt empirical knowledge of human resources directly intervening in actual bicycle operations. Given the current trend of bicycle usage, there are significantly less-used stops, and if you check the actual field opinion on this part and exclude the stops judged as Outliers from the analysis, this will also be a good way to improve the model.