• Ⅰ. Introduction
  • Ⅱ. Setup
  • Ⅲ. Data
    • 3-1. Data Wrangling
    • 3-2. Table of summary
    • 3-3. Correlation Matrix
    • 3-4. Corrleation Scatterplots
    • 3-5. Map of Mean House Price
    • 3-6. Map of Independent variables
    • 3-7. Map of Interests
    • Ⅴ. Results
      • 5-1. Table of results : training set
      • 5-2. Table of goodness of fit : test set
      • 5-3. Cross-Validation MAE histogram
      • 5-4. Predicted prices graph
      • 5-5. Map of residuals
      • 5-6. Spatial lag and Moran’s I test
      • 5-7. Map of Predicted House Price
      • 5-8. Map of MAPE by Neighborhood
      • 5-9. Scatterplot of MAPE
      • 5-10. Split by Census Groups
        • 6-1. Accuracy
        • 6-2. Generalizability
  • Ⅳ. Methods
  • Ⅴ. Results
    • 5-1. Table of results : training set
    • 5-2. Table of goodness of fit : test set
    • 5-3. Cross-Validation MAE histogram
    • 5-4. Predicted prices graph
    • 5-5. Map of residuals
    • 5-6. Spatial lag and Moran’s I test
    • 5-7. Map of Predicted House Price
    • 5-8. Map of MAPE by Neighborhood
    • 5-9. Scatterplot of MAPE
    • 5-10. Split by Census Groups
  • Ⅵ. Discussion
    • 6-1. Accuracy
    • 6-2. Generalizability
  • Ⅶ. Conclusion

Ⅰ. Introduction

  1. What is the purpose of this project?

The purpose of this project is to predict house prices using data from the city Mecklenburg, NC to create a model that can be generalized to other datasets. Using local knowledge, our group may articulate a model of housing price determined by about 40 variables. These variables depict amenities with assumed relationship to housing price. By filtering out negligible data across variables, a correlation matrix and hedonic model may be generated.

  1. Why should we care about it?

Housing prices impact communities across space. Despite variability in local amenities, housing prices may segregate residents by class and race. Federal subsidies of suburbanization and socio-economic climates within the U.S. place home ownership and investment as paramount to notions of success. In addition, the public sector also needs to be very careful, as many home-related taxes, such as property and acquisition taxes, are levied through these predicted prices. For example, if the price of a house located in a community where low-income people live is overestimated, the tax burden can have significant social cascading side effects. Real estate projections of housing prices therefore influence the cultural and socio-political character of neighborhoods.

  1. What makes this a difficult exercise?

Data engineering demonstrates a complex task for data scientists. Determining which variables to include in the model stands as the first obstacle in accurately predicting housing price. In general, predictive models used in statistics do not reflect geographic factors. However, in the case of real estate prices, it is difficult to apply this to model learning because it is greatly affected by community, that is, proximity. In particular, a wide range of expertise and experience are required to learn the boundaries of the community, that is, a housing group that is recognized as a cluster. As community contexts shift between locales, strength of correlation may diminish. This applies to generalizability across specific communities.

  1. What is your overall modeling strategy?

Our overall modeling strategy follows a workflow of downloading and processing Mecklenberg Open Source data, testing the associations of key variables to housing price, test spatial variation of filtered variables, utilizing feature engineering as variables relate to average nearest neighbor tests, and performing diagnostic tests to assess the accuracy and generalizability of our model.

  1. Briefly summarize your results.

Our model’s R² = 0.70, which tells us that our forecasting model can explain more than 70% of the price change. Second, the Mean Absolute Error (MAE) is $115,126. It literally means that the average of the difference between the predicted price and the actual price is $115,126. Finally, the Mean Absolute Percent Error (MAPE) is 0.28, suggesting that the model has an error of about 28%.

                         

Ⅱ. Setup

These are the codes related to libraries, functions, and settings required for analysis. A plethora of libraries and functions were implemented to build the series of tests and visualizations within our model. Packages included tidyverse, tidycensus, and sf to clean and revise our datasets. Caret, ckanr, FNN, grid, were utilized while ggcorrplot was necessary to create correlation plots. KableExtra and jtools derive code efficient for regression model plots. Ggstance and ggpubr compliment ggplot visuals using R-squared values. Broom.mixed is useful for effect plots. Other packages featured code from: knitr, rmarkdown, Rsocrata, viridis, stargazer, ggplot2.

# Rmarkdown global setting
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_chunk$set(fig.align = 'center')

# Import libraries
library(tidyverse)
library(tidycensus)
library(sf)
library(spdep)
library(caret)
library(ckanr)
library(FNN)
library(grid)
library(gridExtra)
library(ggcorrplot)# plot correlation plot
library(corrplot)
library(corrr)      # another way to plot correlation plot
library(kableExtra)
library(jtools)     # for regression model plots
library(ggstance) # to support jtools plots
library(ggpubr)    # plotting R^2 value on ggplot point scatter
library(broom.mixed) # needed for effects plots
library(knitr)
library(rmarkdown)
library(RSocrata)
library(viridis)
library(ggplot2)
library(stargazer)
library(XML)
library(data.table)
library(ggpmisc)
library(patchwork)


# Temp
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")
root.dir = "https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/"

# Etc
options(scipen=999)
options(tigris_class = "sf")

# functions

mapThememin <- function(base_size = 10, title_size = 12, small_size = 8) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = title_size, colour = "black", hjust = 0.5),
    plot.subtitle=element_text(size = base_size, colour = "black", hjust = 0.5, face="italic"),
    plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5),
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.background = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_blank(),
    strip.text.x = element_text(size = base_size),
    strip.text.y = element_text(size = base_size),
    strip.background = element_rect(colour="transparent", fill="transparent"),
    legend.title = element_text(size = small_size),
    legend.text = element_text(size = small_size),
    legend.key.size = unit(0.4, "cm"))
}


corTheme <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_blank(), 
        axis.ticks = element_blank(), 
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}

corTheme2 <- function(base_size = 10, title_size = 12, small_size = 8){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}

corTheme3 <- function(base_size = 9, title_size = 11, small_size = 7){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}

corTheme4 <- function(base_size = 9, title_size = 11, small_size = 7){
  theme(axis.text =  element_text(size = small_size),
        text = element_text(size = 10),
        panel.background = element_rect(fill = greyPalette5[1]),
        axis.title.x = element_text(size = small_size),
        axis.title.y.right = element_text(size = small_size),
        plot.subtitle = element_text(hjust = 0.5, size = base_size,  face="italic"),
        plot.title = element_text(hjust = 0.5, size = title_size),
        plot.caption=element_text(size = small_size, colour = "black", hjust = 0.5))
}


q5 <- function(variable) {as.factor(ntile(variable, 5))}

qBr <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]],0),
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

qBr2 <- function(df, variable, rnd) {
  if (missing(rnd)) {
    as.character(quantile(round(df[[variable]]*100,0)/100,
                          c(.01,.2,.4,.6,.8), na.rm=T))
  } else if (rnd == FALSE | rnd == F) {
    as.character(formatC(quantile(df[[variable]],
                                  c(.01,.2,.4,.6,.8), na.rm=T),
                         digits = 3))
  }
}

substrRight <- function(x, n){
  substr(x, nchar(x)-n+1, nchar(x))
}

# Colors ("https://coolors.co/gradient-palette/a8f368-f9035e?number=7")
bluePalette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
blue2Palette5 <- c("#08519c","#3182bd","#6baed6","#bdd7e7","#eff3ff")
orangePalette5 <- c("#FFFBCB","#FFDCA0","#FEBE75","#FE9F4A","#FD801F")
greyPalette5 <- c("#f7f7f7","#cccccc","#969696","#636363","#252525")
greenPalette5 <- c("#edf8e9","#bae4b3","#74c476","#31a354","#006d2c")
purplePalette5 <- c("#f2f0f7","#cbc9e2","#9e9ac8","#756bb1","#54278f")


# LoadAPI(Min's key)
census_api_key("4bbe4bead4e5817f6a6b79e62c5bea69e77f1887", overwrite = TRUE)
#census_api_key("155cc525674a0d27c98afbb7030d0802e9bb5543", overwrite = TRUE)

Ⅲ. Data

3-1. Data Wrangling

We used data containing the characteristics of houses sold, Census data, and Mecklenburg Quality of Life Explorer, City of Charlotte Open Data Portal. If you look at the distribution of current house prices in Figure 3-1, you can see that a significant part is clustered. In order to increase the accuracy of prediction, it is necessary to closely observe the features related to the clustered causes.

ggplot() +
  geom_sf(data = Nhoods, fill = greyPalette5[2], size = 0.3, color = greyPalette5[1]) +
  geom_sf(data = Mklg.sf, aes(colour = q5(price)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "House Price Distribution Map", 
       subtitle = "Mecklenberg County",
       caption = 'Figure 3-1')+
  scale_colour_manual(values = bluePalette5,labels=qBr(Mklg.sf,"price"),name="House Price($)\n(Quintile)") +
  mapThememin()

3-2. Table of summary

In this study, the hedonic model was used to predict the housing price, and the characteristics included in housing products were divided into three categories. The first is internal characteristics, such as age, area, and number of bedrooms. The second is Public Service/Amenities, which includes Crime and school. The third is the spatial process of price, which corresponds to spatially clustered characteristics.

#Present a table of summary statistics with variable descriptions. Sort these variables by their category (internal characteristics, amenities/public services or spatial structure). Check out the `stargazer` package for this


ic = c("House Area","Bedrooms","House Age","Heated Area","Full baths","Fire places" )
am = c("Grocery/Shopping center","Parks","College/School/Daycare","Wifi Hotspot/Electric car charger" ,"Transit Stations" , "Incidents/311 request")
sp = c("Neighborhood", "Median Household Income ","Percent of White", "Mean of Test Proficiency", "Mean Age of Death", "Mean of Energy Consumption")


table1 <- as.data.frame((data.frame("internal characteristics" = ic, "amenities/public service" = am, "spatial structure" = sp))) 

colnames(table1) = c("Internal Characteristic", "Amenities / Public service", "Spatial Structure")

kable(table1, caption = "<span style='font-size:14px'><b>Table 3-2-1. Independent Variables by category</b></span>") %>% 
  kable_minimal(full_width = T, html_font = "Arial", font_size = 14) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")
Table 3-2-1. Independent Variables by category
Internal Characteristic Amenities / Public service Spatial Structure
House Area Grocery/Shopping center Neighborhood
Bedrooms Parks Median Household Income
House Age College/School/Daycare Percent of White
Heated Area Wifi Hotspot/Electric car charger Mean of Test Proficiency
Full baths Transit Stations Mean Age of Death
Fire places Incidents/311 request Mean of Energy Consumption
variables <- select_if(st_drop_geometry(Mklg.sf), is.numeric) %>% na.omit() %>% select(Price = price, House_Area = shape_Area, Median_House_Income  = MedHHInc, Percent_of_White = pctWhite, Number_of_Bedrooms = bedrooms,House_Age = Age,Heated_Area  = heatedarea, Number_of_Full_Baths= fullbaths, Neighborhood = neighborhood, Percent_of_TreeCanopy=treeCanopy, Number_of_FirePlace =  numfirepla, Elementary_School_Proficiency = testElementary, Percent_of_African_Americen = raceAf, Housing_Size =housingSize, Food_Service = foodService, Violent=violent, Mean_Age_of_Death= ageDeath, Grocery, Parks,  College=college, Wifi=wifi, Daycare=daycare, Firestaion, Electricar_Charger=charger, Building_Age=buildingAge, Internet=internet,School_Attendancy=schoolAttendance, Pedestrian_Signal=pedSignal,Bus_Stop=busStop, Lots=lots, Transit_Stop=transit,Shoppin_Center=shopping, Goldline_Stop=goldLine,Library=library,ROW=row, Art=art, Storm=storm,Historic_Area = historic2, Bicicle=bicycle, Driving_Commuters=drivingCommuters, Impervious_Surface=impSurface, Recreation=recreation, Middle_Shchool_Proficiency=middleProficiency,Percent_of_Voters=voters, Landuse , Incidents, Request_of_311=request311, Percent_of_Workable_Age=pctWorkableAge)

stargazer(variables, title = "Table 3-2-2. Summary Statistics of Variables", style = "apsr", type = "html", single.row = TRUE,  notes.align = "l", notes.label = "")
Table 3-2-2. Summary Statistics of Variables
Statistic N Mean St. Dev. Min Max
Price 44,270 448,609.400 336,475.900 0 5,000,000
House_Area 44,270 15,567.480 29,584.090 1,139.637 2,547,073.000
Median_House_Income 44,270 84,065.750 36,079.120 17,685 238,750
Percent_of_White 44,270 55.489 26.246 1 99
Number_of_Bedrooms 44,270 3.481 0.828 0 9
House_Age 44,270 28.834 24.297 0 194
Heated_Area 44,270 2,312.757 1,020.259 0.000 13,580.000
Number_of_Full_Baths 44,270 2.251 0.800 0 8
Neighborhood 44,270 4,841.597 1,616.028 102 6,411
Percent_of_TreeCanopy 44,270 53.041 8.268 9.565 70.023
Number_of_FirePlace 44,270 0.778 0.466 0 7
Elementary_School_Proficiency 44,270 50.532 15.560 19.834 82.903
Percent_of_African_Americen 44,270 28.130 17.961 3.232 76.762
Housing_Size 44,270 2,183.040 605.124 1,092.161 3,992.991
Food_Service 44,270 13.947 9.741 0.947 48.442
Violent 44,270 4.102 4.864 0.000 25.956
Mean_Age_of_Death 44,270 68.388 6.000 32.856 80.935
Grocery 44,270 7,220.817 4,143.838 742.377 29,896.560
Parks 44,270 5,655.340 2,678.314 295.369 16,925.850
College 44,270 24,785.810 10,195.780 2,041.614 57,784.700
Wifi 44,270 31,970.290 16,520.330 506.659 79,583.070
Daycare 44,270 4,046.637 2,798.538 194.812 22,071.700
Firestaion 44,270 16,647.880 10,849.420 3,617.908 66,256.660
Electricar_Charger 44,270 11,098.700 5,612.350 518.012 40,472.540
Building_Age 44,270 31.230 10.037 7.600 60.833
Internet 44,270 79.155 9.818 47.065 93.638
School_Attendancy 44,270 74.508 11.516 31.095 94.371
Pedestrian_Signal 44,270 17,570.080 14,128.510 446.025 69,222.200
Bus_Stop 44,270 4,718.594 4,553.700 145.797 32,479.790
Lots 44,270 12,897.230 5,393.791 1,738.079 40,707.510
Transit_Stop 44,270 11,090.210 6,377.707 811.751 36,962.670
Shoppin_Center 44,270 7,184.185 4,196.688 484.963 32,116.810
Goldline_Stop 44,270 20,679.760 9,537.689 2,084.405 54,004.040
Library 44,270 15,822.210 11,452.440 212.879 55,903.450
ROW 44,270 6,025.538 9,543.410 4.370 55,072.190
Art 44,270 15,549.960 8,285.736 251.496 49,156.520
Storm 44,270 11,424.200 11,170.950 78.052 60,158.620
Historic_Area 44,270 41,399.940 20,356.730 3,095.319 96,365.510
Bicicle 44,270 0.938 0.521 0.000 2.350
Driving_Commuters 44,270 85.004 6.524 62.006 94.399
Impervious_Surface 44,270 17.485 6.979 3.698 41.939
Recreation 44,270 51.631 23.828 5.006 100.000
Middle_Shchool_Proficiency 44,270 46.286 17.971 15.754 84.350
Percent_of_Voters 44,270 72.011 7.187 50.244 83.765
Landuse 44,270 102.410 28.496 0 700
Incidents 44,270 1,010.485 1,713.503 34.294 13,931.240
Request_of_311 44,270 169.475 258.592 0.175 3,973.719
Percent_of_Workable_Age 44,270 6.890 2.729 0 17

                         

3-3. Correlation Matrix

One of the important assumptions of multiple linear regression is that variables should not have strong correlations. To confirm this, a correlation matrix was prepared. In the bellow matrix, only one of the two variables with r>0.9 (or r<-0.9) was selected and used for model training.

#Present a correlation matrix

# Numeric variable selecting for correlation matrix 

numericVars <- 
  select_if(st_drop_geometry(Mklg.sf), is.numeric) %>% na.omit()

# Correlation matrix

testRes = cor.mtest(numericVars, conf.level = 0.95)

c2 <- numericVars %>% dplyr::select(Price = price, HouseArea = shape_Area, HouseIncome  = MedHHInc, PctWhite = pctWhite, Beadroom = bedrooms, HouseAge = Age,
                                                HeatedArea  = heatedarea, FullBaths= fullbaths, Neighborhood = neighborhood, treeCanopy, FirePlace =  numfirepla,
                                                  testElementary, raceAf, housingSize, foodService,
                                                  violent, ageDeath, Grocery,
                                                  Parks, college, wifi, daycare, Firestaion, charger,
                                                  buildingAge, internet, schoolAttendance,
                                                  pedSignal,busStop, lots, transit,shopping, goldLine,library,
                                                  row, art, storm, historic2,
                                                  bicycle, drivingCommuters, impSurface, recreation,
                                                  middleProficiency, voters, Landuse , Incidents, request311, pctWorkableAge) %>%
  cor() %>%
  corrplot(sub = "Figure 3-3. Correlation Matrix of Variables", method = 'circle', p.mat = testRes$p, insig='blank', col=COL1('Blues', 10), order = 'FPC', type = 'lower', tl.cex = 0.5, tl.col = greyPalette5[4],  addgrid.col = 'transparent', diag = TRUE, tl.srt = 45, cl.cex = 0.6, cl.offset = 0.2, bg = "transparent", sig.level = 0.1)

3-4. Corrleation Scatterplots

Presented are 4 Correlation Scatterplots among various Variables. Of course, the correlation between house area and price was predicted to be high, but it was found to be very low at 0.04. Natural logarithmic transformation seems to be necessary, it was omitted for our regression modelling. As House Age increased, the lower the price, but the correlation was very low. On the other hand, it was found that bedroom and household income had a significant correlation with house price.

#Present 4 home price correlation scatter plots that you think are of interest. I’m going to look for interesting open data that you’ve integrated with the home sale observations.

#log transform

Mklg.sf <- 
  Mklg.sf %>%
  mutate(lg_shape_Area = log(shape_Area+1), lg_Age = log(Age+1), lg_MedHHInc = log(MedHHInc+1))


css1 <-
  ggplot(Mklg.sf, aes(x = price, y = lg_shape_Area))+
  geom_point(colour = greyPalette5[5], size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1)+
  stat_poly_eq(label.y = 0.95, label.x = 0.05, size = 3, color = bluePalette5[5])+
  labs(subtitle = "Plot1 : logged House Area",
       x = "Price",
       y = "logged House Area") +
  corTheme() 

css2 <-
  ggplot(Mklg.sf, aes(x = price, y = lg_Age))+
  geom_point(colour = greyPalette5[5], size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1)+
  stat_poly_eq(label.y = 0.95,label.x = 0.05, size = 3, color = bluePalette5[5])+
  labs(subtitle = "Plot2 : logged House Age",
       x = "Price",
       y = "logged House Age") +
  corTheme()

css3 <-
  ggplot(Mklg.sf, aes(x = price, y = bedrooms))+
  geom_point(colour = greyPalette5[5], size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1) +
  stat_poly_eq(label.y = 0.95, label.x = 0.05, size = 3, color = bluePalette5[5])+
  labs(subtitle = "Plot3 : Number of Bedrooms",
       x = "Price",
       y = "Bedrooms") +
  corTheme()

css4 <-
  ggplot(Mklg.sf, aes(x = price, y = MedHHInc))+
  geom_point(colour = greyPalette5[5], size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[5], size = 1)+
  stat_poly_eq(label.y = 0.95, label.x = 0.05, size = 3 ,color = bluePalette5[5]) +
  labs(subtitle = "Plot4 : Median Household Income",
       x = "Price",
       y = "Household Income") +
  corTheme()

3-5. Map of Mean House Price

Mean housing price is visualized by averaging data on actual sales of about 46,000 houses based on Tracts. This visualization displays obvious clustering, and additionally, it is intended to be used to select a feature by comparing it with the average value of the tracts of independent variables.

# Calculate tract's average value of dependent variable

Mklg.sf.DV <- Mklg.sf %>%
  group_by(GEOID) %>%
  summarise(Avg.HP = mean(price),
            AVg.AG = mean(Age),
            Avg.AR = mean(shape_Area)) %>%
  st_drop_geometry() %>%
  mutate_all(~replace(., is.na(.), 0))

Mklg.Nhoods <- Nhoods %>%
  left_join(Mklg.sf.DV, by = "GEOID")
#Develop 1 map of your dependent variable (sale price)

M35 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(Avg.HP)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = purplePalette5,
                    labels = round(as.numeric(qBr(Mklg.Nhoods, "Avg.HP"))),
                    name = "House Price($)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Map of Mean House Price", 
       subtitle = "Mecklenberg County",
       caption = 'Figure 3-5') +
  mapThememin()

M35

3-6. Map of Independent variables

It is a process to intuitively grasp the correlation with the distribution of the current house price by summarizing eight of the independent variables based on tracts and mapping them. Most of the features used here are spatially clustered. If this clustering is similar to the clustering pattern of the current house price, there is a high probability that it will be considered spatial autocorrelation.

#Develop 3 maps of 3 of your most interesting independent variables.


M352 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(Avg.HP)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = purplePalette5,
                    labels = round(as.numeric(qBr(Mklg.Nhoods, "Avg.HP"))/1000),
                    name = "House Price\n(1,000$)",
                    na.translate=FALSE) +
  labs(title = "Mean House Price",
       caption = 'Figure 3-6-1')  +
  mapThememin()

#1IDV : House Age
M361 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = rev(q5(AVg.AG))), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = rev(qBr(Mklg.Nhoods, "AVg.AG")),
                    name = "Age(years)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Mean House Age", 
       caption = 'Figure 3-6-1') +
  mapThememin()


#2IDV : House Area
M362 <- ggplot()+
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(Avg.AR)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = round(as.numeric(qBr(Mklg.Nhoods, "Avg.AR"))/100),
                    name = "Area\n(100f²)",
                    na.translate=FALSE) +
  labs(title = "Mean House Shape Area",
       caption = 'Figure 3-6-2') +
  mapThememin()


#3IDV : Median Household Income
M363 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(MedHHInc)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = round(as.numeric(qBr(Mklg.Nhoods, "MedHHInc"))/1000),
                    name = "Income\n(1,000$)",
                    na.translate=FALSE) +
  labs(title = "Median Household Income",
       caption = 'Figure 3-6-3') +
  mapThememin()

#4IDV : percent of White
M364 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(pctWhite)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "pctWhite"),
                    name = "White(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Percent of White",
       caption = 'Figure 3-6-4') +
  mapThememin()

#5IDV : percent of Vacant Housing
M365 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(pctVacnant)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "pctVacnant"),
                    name = "Vacant(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Percent of Vacant Housing",
       caption = 'Figure 3-6-5') +
  mapThememin()

#6IDV : Gas Consumption
M366 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(gasConsumption)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "gasConsumption"),
                    name = "Vacant(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Gas Consumption",
       caption = 'Figure 3-6-6') +
  mapThememin()

#7IDV : Crime Violent
M367 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(violent)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "violent"),
                    name = "Vacant(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Crime Violent",
       caption = 'Figure 3-6-7') +
  mapThememin()

#8IDV : Elementary Proficiency
M368 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(testElementary)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "testElementary"),
                    name = "Proficiency(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Elementary Proficiency",
       caption = 'Figure 3-6-8') +
  mapThememin()

#9IDV : Percent of Poverty
M369 <- ggplot()+
  geom_sf(data = Mklg.Nhoods, aes(fill = q5(pctPoverty)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5,
                    labels = qBr(Mklg.Nhoods, "pctPoverty"),
                    name = "Poverty(%)\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Percent of Poverty",
       caption = 'Figure 3-6-9') +
  mapThememin()

According to Figure 3-6-1, it can be seen that the house is older as you go to the outskirts of the city. At the same time, Figure 3-6-2 shows that the size of the house increases as it goes to the outskirts of the city, and it can be seen that large houses are distributed although they were built relatively recently in the southeast of the city.

Figure 3-6-3 maps the median income distribution, and Figure 3-6-4 maps the White residence ratio. Clustering patterns are very similar to each other, which is very similar to Figure 3-5.

Figure 3-6-5 is a graph showing the ratio of vacant houses. The proportion of vacant houses does not seem clustered. On the other hand, Figure 3-6-8 shows gas usage, and shows a similar trend to the distribution of income in Figure 3-6-3.

Figure 3-6-7 shows the distribution of crimes. It can be seen that the most crimes occur around the city center. Figure 3-6-8 shows elementary school learning achievement. This map shows the Space Autocorlation most similar to the distribution of housing prices in Figure 3-5.

3-7. Map of Interests

#Include any other maps/graphs/charts you think might be of interest

# GroeceryStores / MedicalFacilities / Parks / PlaceofWorship

# Density map
MI1 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, fill = greyPalette5[2], color = greyPalette5[1], size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(add.a)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = bluePalette5[1], high = bluePalette5[5],
                      breaks=c(0.00000000002,0.00000000016),
                      labels=c("Min","Max"),
                      name = "Density") +
  scale_alpha(range = c(0.05, 0.45), guide = FALSE) +
  labs(title = "Density of Groecery Stores", 
       subtitle = "Total : 219 places",
       caption = 'Figure 3-7-1') +
  mapThememin()

MI2 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, fill = greyPalette5[2], color = greyPalette5[1], size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(add.b)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = bluePalette5[1], high = bluePalette5[5],
                      breaks=c(0.00000000004,0.00000000045),
                      labels=c("Min","Max"),
                      name = "Density") +
  scale_alpha(range = c(0.05, 0.45), guide = FALSE) +
  labs(title = "Density of Medical Facilities", 
       subtitle = "Total : 906 places",
       caption = 'Figure 3-7-2') +
  mapThememin()

MI3 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, fill = greyPalette5[2], color = greyPalette5[1], size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(add.c)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = bluePalette5[1], high = bluePalette5[5],
                      breaks=c(0.00000000002,0.00000000022),
                      labels=c("Min","Max"),
                      name = "Density") +
  scale_alpha(range = c(0.05, 0.45), guide = FALSE) +
  labs(title = "Density of Parks", 
       subtitle = "Total : 351 places",
       caption = 'Figure 3-7-3') +
  mapThememin()

MI4 <- ggplot() +
  geom_sf(data = Mklg.Nhoods, fill = greyPalette5[2], color = greyPalette5[1], size = 0.2) +
  stat_density2d(data = data.frame(st_coordinates(add.d)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = bluePalette5[1], high = bluePalette5[5],
                      breaks=c(0.00000000003,0.00000000029),
                      labels=c("Min","Max"),
                      name = "Density") +
  scale_alpha(range = c(0.05, 0.45), guide = FALSE) +
  labs(title = "Density of Worship place", 
       subtitle = "Total : 596 places",
       caption = 'Figure 3-7-4') +
  mapThememin()

Assuming that the places you go regularly in your daily life will be closely related to housing prices, four of the various facilities were used to map the density. Interestingly, the number of hospitals was the highest among the four facilities, but the density was also the highest. In particular, the place where hospitals were concentrated was an area with relatively high housing prices, as can be seen from the map written above. On the contrary, it can be seen that the grocery store with the smallest number is relatively evenly distributed. The rest, parks and religious facilities are concentrated in the city center like hospitals. This, too, appears to be closely related to housing prices

Ⅳ. Methods

The machine learning used in this study is mainly composed of 4 steps (Data Wrangling > Exploratory analysis > Feature Engineering > OLS Regression). Steps 1, 2, and 3 of the 4 steps were carried out in Chapter 3. Steps 1, 2, and 3 went through the data wrangling step of collecting and processing data, the exploratory analysis that checks the correlation and spatial distribution between variables on a map, and the feature engineering step of variable assessing the average neighbor distance. We will use this data for machine learning. Specifically, 80% of the data was used as a ‘Training Set’ to create a predictive model, and the remaining 20% was set as a ‘Test Set’ to measure the accuracy and generalizability of the modeling. There are various statistical models for abstracting reality, but among them, the Ordinary Linear Regression model was used. The reason is that the process is relatively transparent and the calculation process is efficient. We performed the prediction calculation on the assumption that the house price can be predicted with a combination of several independent variables. This is called the hedonic model, and it is a model that is traditionally used a lot in house price prediction models. Through predictive models, results such as R² (coefficient of determination), MAE (mean of absolute error), and MAPE (mean absolute percentage error) can be derived. These figures are generally used as a measure of model fit and accuracy. R² is a number with a value between 0-1 and is a key indicator of ‘Goodness of fit’. For example, if R² is 0.7, this means that this model can account for about 70% of the price change. Furthermore, it is possible to visually and intuitively check the accuracy of the prediction through the Scatterplot of the predicted price and the actual observed price. Next, we measure the generalizability of this model through cross-validation. In the above machine learning, one random model was tested by randomly splitting the existing data by 8:2. Cross-Validation involves repeating this process more and more to see gaps between models. If you check the distribution of the results of these models, you can check the generalizability of our regression function.

                         

Ⅴ. Results

5-1. Table of results : training set

#Split the ‘toPredict’ == “MODELLING” into a training and test set. Provide a polished table of your training set lm summary results (coefficients, R2 etc)

Mklg.Modelling <- Mklg.sf %>% filter(toPredict == "MODELLING")
Mklg.Challenge <- Mklg.sf %>% filter(toPredict == "CHALLENGE")

set.seed(127)

inTrain <- createDataPartition(y = paste(Mklg.Modelling$lot_num), p = 0.8, list = FALSE)

Mklg.training <- Mklg.Modelling[inTrain,]
Mklg.test <- Mklg.Modelling[-inTrain,]


reg.training <- lm(price ~ ., data = st_drop_geometry(Mklg.training) %>% 
                                    dplyr::select(price, shape_Area, MedHHInc, pctWhite, bedrooms, Age,
                                                  heatedarea, fullbaths, neighborhood, treeCanopy, numfirepla,
                                                  testElementary, raceAf, housingSize, foodService,
                                                  violent, ageDeath, Grocery,
                                                  Parks, college, wifi, daycare, Firestaion, charger,
                                                  buildingAge, internet, schoolAttendance,
                                                  pedSignal,busStop, lots, transit,shopping, goldLine,library,
                                                  row, art, storm, historic2,
                                                  bicycle, drivingCommuters, impSurface, recreation,
                                                  middleProficiency, voters, Landuse , Incidents, request311, pctWorkableAge)) # Incidents, request311,

stargazer(reg.training, title = "Table 5-1. Summary Statistics of Training Set", style = "apsr", type = "html", single.row = TRUE,  notes.align = "l", notes.label = "")
Table 5-1. Summary Statistics of Training Set
price
shape_Area 1.244*** (0.035)
MedHHInc 0.999*** (0.052)
pctWhite 552.436*** (106.832)
bedrooms -46,218.190*** (1,717.270)
Age -633.458*** (60.860)
heatedarea 189.128*** (1.827)
fullbaths 43,370.140*** (2,130.209)
neighborhood -13.540*** (1.484)
treeCanopy -2,685.152*** (247.963)
numfirepla 8,727.476*** (2,541.672)
testElementary 3,001.035*** (325.217)
raceAf -1,498.343*** (314.208)
housingSize 122.539*** (6.146)
foodService 5,081.309*** (586.052)
violent -5,494.006*** (518.049)
ageDeath 11.412 (294.877)
Grocery -8.302*** (1.032)
Parks 3.194*** (0.566)
college -4.563*** (0.295)
wifi 3.442*** (0.391)
daycare 2.926*** (0.763)
Firestaion 3.486*** (0.689)
charger 2.547*** (0.385)
buildingAge 3,055.759*** (271.270)
internet 1,188.687*** (307.010)
schoolAttendance 1,843.176*** (282.046)
pedSignal 0.834*** (0.288)
busStop -2.251*** (0.556)
lots -4.425*** (0.429)
transit 5.038*** (0.467)
shopping 2.086** (0.967)
goldLine 0.856** (0.354)
library 3.768*** (0.246)
row 4.784*** (0.638)
art 3.082*** (0.395)
storm -1.654*** (0.338)
historic2 -12.069*** (0.366)
bicycle 24,696.270*** (5,150.500)
drivingCommuters 2,496.087*** (393.158)
impSurface 6,488.097*** (441.383)
recreation -441.988*** (91.708)
middleProficiency -1,881.922*** (296.886)
voters -4,451.392*** (600.893)
Landuse 102.093*** (36.979)
Incidents 12.314*** (1.144)
request311 -18.949*** (5.555)
pctWorkableAge -3,066.023*** (431.138)
Constant -216,712.000*** (46,163.270)
N 36,873
R2 0.683
Adjusted R2 0.683
Residual Std. Error 195,795.900 (df = 36825)
F Statistic 1,690.453*** (df = 47; 36825)
p < .1; p < .05; p < .01

This table summarizes the results of the training set. Running an R² correlation test offers insight into the accuracy variability the numerous independent variables have in accounting for changes in our dependent variable (housing price). An R² value of 0.683 represents a moderately strong association between our independent variables and housing price. The F-statistic depicts whether our model as a whole predicts significantly better than average base models. An F-statistic over 1000 describes a correlation that is 95% confident that this model predicts better than average. These variables therefore explain approximately 68.2% of variability in housing price in Mecklenburg County.

            

5-2. Table of goodness of fit : test set

This statistic shows how well the model computed above predicts data you’ve never seen before. First, R² = 0.70, which tells us that our forecasting model can explain more than 70% of the price change. Second, the Mean Absolute Error (MAE) is $115,126. It literally means that the average of the difference between the predicted price and the actual price is $115,126. Finally, the Mean Absolute Percent Error (MAPE) is 0.28, suggesting that the model has an error of about 28%.

#Provide a polished table of mean absolute error and MAPE for a single test set. Check out the “kable” function for markdown to create nice tables

Mklg.test <-
  Mklg.test %>%
  mutate(price.Predict = predict(reg.training, Mklg.test),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)

Mklg.training <-
  Mklg.training %>%
  mutate(price.Predict = predict(reg.training, Mklg.training),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)

Mklg.Challenge <-
  Mklg.Challenge %>%
  mutate(price = predict(reg.training, Mklg.Challenge),
         price.Predict = predict(reg.training, Mklg.Challenge),
         price.Error = price.Predict - price,
         price.AbsError = abs(price.Predict - price),
         price.APE = (abs(price.Predict - price)) / price.Predict)

Mklg.Challenge.csv <- Mklg.Challenge %>% dplyr::select(musaID, price.Predict) %>% st_drop_geometry()

write_csv(Mklg.Challenge.csv, file =  "Thwitches.csv")

R = cor(Mklg.test$price, Mklg.test$price.Predict, method = "pearson", use = "complete.obs")
R2 = R*R
MAE = mean(Mklg.test$price.AbsError, na.rm = T)
MAPE = mean(Mklg.test$price.APE, na.rm = T)

Mklg.test.summary <- as.data.frame(cbind(R2, MAE, MAPE))

colnames(Mklg.test.summary)<-c("R Squre", "MAE", "MAPE")

kable(Mklg.test.summary, caption = "<span style='font-size:14px'><b>Table 5-2. Summary table of Test set results</b></span>", align = "c") %>% 
  kable_minimal(full_width = T, html_font = "Arial", font_size = 14) %>%
  footnote(general = "", fixed_small_size = T, general_title = "")
Table 5-2. Summary table of Test set results
R Squre MAE MAPE
0.7024178 115126.5 0.2882856

                         

5-3. Cross-Validation MAE histogram

The generalizability of the model can be measured through cross-validation. In this study, about 500 folds were tested. There are two main ways to interpret this test. The first is simply to look at the standard deviation of the folds MAE. $15,244 tells you the amount of change between folds. In addition, since the MAE values shown in Table 5-2 just above are $115,126 and the average MAE of Cross-Validation 500 are only $116,036, the current model is good. This change can also be confirmed through a histogram using the MAE of fold. If the model generalizes well, it is likely to form clusters. Our model shows a normal distribution, but it does not appear to form a cluster.

#Provide the results of your cross-validation tests. This includes mean and standard deviation MAE. Do 100 folds and plot your cross-validation MAE as a histogram. Is your model generalizable to new data?


fitControl <- trainControl(method = "cv", number = 500)

set.seed(123)

reg.cv <- 
  train(price ~ ., data = st_drop_geometry(Mklg.sf) %>% 
                                dplyr::select(price, shape_Area, MedHHInc, pctWhite, bedrooms, Age,
                                                  heatedarea, fullbaths, neighborhood, treeCanopy, numfirepla,
                                                  testElementary, raceAf, housingSize, foodService,
                                                  violent, ageDeath, Grocery,
                                                  Parks, college, wifi, daycare, Firestaion, charger,
                                                  buildingAge, internet, schoolAttendance,
                                                  pedSignal,busStop, lots, transit,shopping, goldLine,library,
                                                  row, art, storm, historic2,
                                                  bicycle, drivingCommuters, impSurface, recreation,
                                                  middleProficiency, voters, Landuse , Incidents, request311, pctWorkableAge), 
     method = "lm", trControl = fitControl, na.action = na.pass)

#reg.cv
#reg.cv$resample[1:5,]
#reg.cv$resample[,3]

MAE.cv = data.table(MAE.cv= reg.cv$resample[,3])

#MAE.cv

MAE.sd <- as.character(round(sd(reg.cv$resample[,3])))
MAE.mean <- as.character(round(mean(reg.cv$resample[,3])))

caption <- paste("Sd =",MAE.sd, "  Mean =",MAE.mean, " K = 500")

ggplot(MAE.cv, aes(x = MAE.cv))+
  geom_histogram(fill=bluePalette5[3], color = greyPalette5[1], size = 2, xlim=c(20,50))+
  labs(title = "Cross-Validation MAE histogram",
       subtitle = caption,
       x="MAE",
       y="Count",
       caption = "Figure 5-3") +
  corTheme2()  

                         

5-4. Predicted prices graph

This plots observed price as function of Predicted price. Grey line right under the blue line represent the perfect fit and the blue line represent average predicted fit. If the model were perfectly fit, two lines would overlap.

#Plot predicted prices as a function of observed prices

Mklg.training$set <- "Training"
Mklg.test$set <- "Test"
Mklg.Challenge$set <- "challenge"

Mklg.All <-
  rbind(Mklg.training, Mklg.test) 

Mklg.All.Challenge <-
  rbind(Mklg.All ,Mklg.Challenge) 

#head(Mklg.All)

ggplot(Mklg.All, aes(x = price.Predict, y = price))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Predicted prices as a function of Observed prices",
       subtitle = "Grey line : Prefect prediction     Blue line : Average Prediction",
       x="Predicted price",
       y="Observed price",
       caption = "Figure 5-4") +
  corTheme2()  

                         

5-5. Map of residuals

Figure 5-5. shows the residual of the house price prediction model. Although the difference is large, it can still be seen that the noise isn’t clustered. To check this, the price of a house can be compared with the average of the prices of neighboring houses. In this study, the average value of the five houses closest to the house was used and this was defined as the lag price and do Moran’s I test.

##Provide a map of your residuals for your test set. Include a Moran’s I test and a plot of the spatial lag in errors.


ggplot() +
  geom_sf(data = Nhoods, fill = greyPalette5[2], size = 0.3, color = greyPalette5[1]) +
  geom_sf(data = Mklg.test, aes(colour = q5(price.Error)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Residual", 
       subtitle = "Test set",
       caption = 'Figure 5-5') +
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(Mklg.test,"price.Error"))/1000),
                      name="Residual\n(1,000$)",
                      na.translate=FALSE) +
  mapThememin()

                         

5-6. Spatial lag and Moran’s I test

Figure 5-6-1 is a scatterplot of Prices as a function of Spatial Lag Price. What we can confirm here is that as the price increases, the average of nearby house prices also rises. Also, as can be seen from the degree of overlap between the gray line and the blue line, it can be recognized that there is little difference between the two. This is evidence that house prices are clustering. Figure 5-6-2 is created based on the data learned from our model. Similarly, as the error increases, it can be seen that the error of the surrounding house price also increases. Spatial autocorrelation can also be measured with Moran’s I in Figure 5-6-2. Moran’s I has a value from -1 to 1, with values closer to 1 indicating clustering, and closer to -1 meaning spreading. Moran’s I = 0 means that all values are randomly distributed. It can be seen that the current Moran’s I = 0.19, indicating less degree of clustering of errors.

coords <- st_coordinates(Mklg.sf) 
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
Mklg.sf$lagPrice <- lag.listw(spatialWeights, Mklg.sf$price)

coords.test <-  st_coordinates(Mklg.test) 
neighborList.test <- knn2nb(knearneigh(coords.test, 5))
spatialWeights.test <- nb2listw(neighborList.test, style="W")
Mklg.test$lagPrice <- lag.listw(spatialWeights.test, Mklg.test$price)
Mklg.test$lagPriceError <- lag.listw(spatialWeights.test, Mklg.test$price.Error) 


#Spatial lag : Price as a function of the spatial lag of price

SL1 <- ggplot(Mklg.test, aes(x = lagPrice, y = price))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Prices as a function of Spatial Lag Price",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Price",
       y="Observed price",
       caption = "Figure 5-6-1") +
  corTheme3()  

#Spatial lag : Error as a function of the spatial lag of price

SL2 <- ggplot(Mklg.test, aes(x = lagPriceError, y = price.Error))+
  geom_point(size =0.8, alpha = 0.8, color = bluePalette5[3])+
  geom_abline(color = greyPalette5[4])+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  scale_y_continuous(position = "right")+
  labs(title = "Errors as a function of Spatial Lag Errors",
       subtitle = "Grey line : Perfect     Blue line : Average",
       x="Spatial Lag Error",
       y="Error",
       caption = "Figure 5-6-2") +
  corTheme4()  

moranTest <- moran.mc(Mklg.test$price.Error, 
                      spatialWeights.test, nsim = 999)


MT <- ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = bluePalette5[5],size=1) +
  scale_x_continuous(limits = c(-1, 1))+
  labs(title = "Observed and permuted Moran's I",
       subtitle = "Observed Moran's I = 0.1958 : Blueline",
       x="Moran's I",
       y="Count",
       caption = "Figure 5-6-3")  +
  corTheme2()  

SL1 + SL2

                         

5-7. Map of Predicted House Price

Figure 5-7-1 maps the distribution of actual house prices, and Figure 5-7-2 maps the distribution of the entire dataset predicted through the model. When the degree of clustering of the two models is visually grasped, it can be seen that the training of the model is relatively well accomplished. Additionally, it is possible to check the generalizability of the model through Map of MAPE and Map of Contexts.

#Provide a map of your predicted values for where ‘toPredict’ is both “MODELLING” and “CHALLENGE”

PP <- ggplot() +
  geom_sf(data = Nhoods, fill = greyPalette5[2], size = 0.3, color = greyPalette5[1]) +
  geom_sf(data = Mklg.All.Challenge, aes(colour = q5(price.Predict)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Predicted House Price", 
       subtitle = "Entire data set",
       caption = 'Figure 5-7-1')+
  scale_colour_manual(values = bluePalette5,
                      labels=round(as.numeric(qBr(Mklg.All,"price.Predict"))/1000),
                      name="Predicted\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapThememin()

OP <- ggplot() +
  geom_sf(data = Nhoods, fill = greyPalette5[2], size = 0.3, color = greyPalette5[1]) +
  geom_sf(data = Mklg.All.Challenge, aes(colour = q5(price)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Observed House Price", 
       subtitle = "Entire data set",
       caption = 'Figure 5-7-2')+
  scale_colour_manual(values = purplePalette5,
                      labels=round(as.numeric(qBr(Mklg.All,"price"))/1000),
                      name="Observed\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapThememin()

NT <- ggplot() +
  geom_sf(data = Nhoods, fill = greyPalette5[2], size = 0.3, color = greyPalette5[1]) +
  geom_sf(data = Mklg.All.Challenge %>% filter(price.Predict < 0) , aes(colour = q5(price)), alpha = 0.4, size = 0.5, show.legend = "point")+
  labs(title = "Map of Observed House Price", 
       subtitle = "Entire data set",
       caption = 'Figure 5-7-2')+
  scale_colour_manual(values = purplePalette5,
                      labels=round(as.numeric(qBr(Mklg.All,"price"))/1000),
                      name="Observed\nPrice\n(1,000$)",
                      na.translate=FALSE)+
  mapThememin()

#NT

OP + PP + plot_layout(nrow = 1)

5-8. Map of MAPE by Neighborhood

Figure 5-8 was prepared to check generalizability through the error distribution of predicted values in neighborhood. It can be seen that the regions where there is a lot of deviation are regions with little data to learn and few housing transactions. Based on the basic concept of machine learning, learning from experience, it is judged that it is difficult to compensate for the lack of data in the current study.

#Using the test set predictions, provide a map of mean absolute percentage error (MAPE) by neighborhood

Mklg.test.MAPE <- Mklg.test %>%
  group_by(GEOID) %>%
  summarise(meanMAPE = mean(price.APE), meanPrice = mean(price)) %>%
  st_drop_geometry()

Mklg.test.MAPE <- Nhoods %>% 
  left_join(Mklg.test.MAPE, by = "GEOID")

ggplot() +
  geom_sf(data = Mklg.test.MAPE, aes(fill = q5(meanMAPE)), color = greyPalette5[1], size = 0.2) +
  scale_fill_manual(values = bluePalette5 ,
                    labels = round(as.numeric(qBr2(Mklg.test.MAPE, "meanMAPE"))*100)/100,
                    name = "MAPE\n(Quintile)",
                    na.translate=FALSE) +
  labs(title = "Map of MAPE by neighborhood", 
       subtitle = "Test data set",
       caption = 'Figure 5-8') +
  mapThememin()

            

5-9. Scatterplot of MAPE

Figure 5-9 is a scatterplot of MAPE by neighborhood as a function of mean price by neighborhood. Through this, it can be seen that MAPE is clustered intensively except for a few outliers. This suggests that the current model is relatively well designed.

#Provide a scatterplot plot of MAPE by neighborhood as a function of mean price by neighborhood

ggplot(Mklg.test.MAPE %>% filter(meanMAPE < 2000) %>% filter(meanMAPE > -80), aes(x = meanMAPE, y = meanPrice))+
  geom_point(size=1)+
  geom_smooth(method = "lm", se = FALSE, colour = bluePalette5[4], size = 1)+
  labs(title = "Scatterplot of MAPE as a function of mean price",
       subtitle = "by neighborhood",
       x="MAPE",
       y="price",
       caption = "Figure 5-9") +
  corTheme2()  

            

5-10. Split by Census Groups

In order to confirm generalizability, Income and Race context were used for analysis. Looking at the distribution of predicted prices in Figure 5-10-3, it can be seen that they are roughly similar to Income and Race context.

#Using tidycensus, split your study area into two groups (perhaps by race or income) and test your model’s generalizability. Is your model generalizable

Mklg.Nhoods.split <- Mklg.Nhoods %>%
  mutate(raceContext = ifelse(pctWhite > 50, "White", "Non-White"),
         incomeContext = ifelse(MedHHInc > mean(MedHHInc,na.rm = T), "High", "Low"))

CT1 <- ggplot() +
  geom_sf(data = Mklg.Nhoods.split, aes(fill = raceContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[1],bluePalette5[4]),
                    name = "RaceContext",
                    na.translate=FALSE) +
  labs(title = "Race Context", 
       subtitle = "Majority White = Percent of White > 50%",
       caption = 'Figure 5-10-2') +
  mapThememin()

CT2 <- ggplot() +
  geom_sf(data = Mklg.Nhoods.split, aes(fill = incomeContext), color = greyPalette5[2], size = 0.2) +
  scale_fill_manual(values = c(bluePalette5[4],bluePalette5[1]),
                    name = "IncomeContext",
                    na.translate=FALSE) +
  labs(title = "Income Context", 
       subtitle = "High Income = Income > Median Income",
       caption = 'Figure 5-10-3') +
  mapThememin()

PP + labs(caption = 'Figure 5-10-1')

Ⅵ. Discussion

1.Is this an effective model?

This model is moderately effective. With about 40 variables and modest computation time, we can predict housing prices with a 28% mean average error, and a relatively stable performance across various testing sets. While this is not accurate enough to base policy on, it can provide context for planners and other professionals and is generalizable enough to be used on different datasets, therefore it is a modestly effective model.

2.What were some of the more interesting variables?

As explained in 3-7. Map of Interests, grocery stores, hospital facilities, religious facilities, and parks were interesting. In summary, among the four facilities, the number of hospitals was the highest, but the density was also the highest. In particular, the place where hospitals were concentrated was an area with relatively high housing prices, as can be seen from the map written above. On the contrary, it can be seen that the grocery store with the smallest number is relatively evenly distributed. The rest, parks and religious facilities are concentrated in the city center like hospitals. This also seems to be closely related to housing prices. Also, elementary testing and elementary proficiency was interested too, which both had a strong spatial pattern that appeared strongly correlated with household income and percentage of white population. This supports the spatial theory that predominantly high-income white areas have school with greater funding and thus better performance.

3.How much of the variation in prices could you predict?

Our models’ the Mean Absolute Percent Error (MAPE) is 0.28, suggesting that the model has an error, variation of about 28%. Also, R² is 0.70, which tells us that our forecasting model can explain more than 70% of the price change.

4.Describe the more important features?

Many variables were used, but when checked through the map, variables similar to the clustering pattern of real housing prices, such as white resident ratio, household income, etc., seemed to have helped the generalizability of this real model a lot. Household income shows one of the higher correlations on its own with housing prices, suggesting that people who live in more expensive homes tend to make more money. This makes a lot of sense.

5.Describe the error in your predictions?

The mean average error was $115,126 which is rather large, therefore this model would not be accurate enough to inform taxation policy for example, but could be useful for planners to get a rough picture of the future housing market. Also, in Figure 5-8 it can be seen that the regions where there is a lot of deviation of error are regions with little data to learn and few housing transactions. Based on the basic concept of machine learning, learning from experience, it is judged that it is difficult to compensate for the lack of data in the current study.

6.According to your maps, could you account the spatial variation in prices?

Our model could account for spatial variation in prices because it proved to be fairly accurate and generalizable across datasets containing houses in different locations. Our model produced a Moran’s I score of 0.1958 of the error terms, indicating some degree of spatial clustering of bias. Therefore, our model has a small amount of spatial bias. Also, in the context of race and income, it is likely that it can explain the spatial price distribution in part.

7.Where did the model predict particularly well? Poorly?

Looking at the Map of Residual of 5-5, we can see that the error is distributed almost randomly, and the prediction of a particular region does not seem to be too far off. Our model had a decent R² value of about 0.70, however for application purposes we think that a model with a higher R² value would be needed. We feel that our mean absolute error was also someone high at around 28%, and that our errors were slightly spatially correlated, suggesting that some neighborhood areas would be disproportionately affected by our model’s shortcomings. However, we feel that our model does well in providing a “big-picture” context of Mecklenburg County with relatively few variables and low computation power required. In addition, our model had impressive generalizability, with a mostly normal distribution of mean absolute percent error across cross-validation test sets.

8.why do you think this might be?

A large reason we believe that this model did not perform more accurately is that we were not able to pull additional data that would be of importance. In addition, we believe that statistical learning methods for variable selection such as forward and backward AIC and BIC or random forest methods would have given our model a greater edge in producing accurate predictions for test sets. We believe that our model had good generalizability because of the size and quality of the dataset with which it was produced, meaning that our model was not skewed by outlier observations that are not representative of other areas.

6-1. Accuracy

When tested on the test set we produced a mean absolute error of $115,126 which is 38% of the median predicted housing price. This can be considered accurate, although these considerations are relative to the goal of the project. When looking at our visualization of observed housing prices as a function of predicted housing prices, one can see that we may be slightly underestimating housing prices. This presents a bias in our model that could be adjusted for with further data engineering. On the other hand, the accuracy of the model can be determined in other way, by comparing the results of the test set with the results of Cross-Validation. Currently, the MAE of the test set is $115,126, and the average of the Cross-Validation MAE is $116,036. The error occurs approximately $900, which proves that the model is relatively accurate. Nevertheless, it seems difficult to directly use this error in sensitive policies such as tax imposition. The careful design of the variables should be supplemented. As for the Our R² is 0.70, which tells us that our forecasting model can explain more than 70% of the price change. And the Mean Absolute Error (MAE) is $115,126. It literally means that the average of the difference between the predicted price and the actual price is $115,126. The Mean Absolute Percent Error (MAPE) is 0.28, suggesting that the model has an error of about 28%.

When tested on the test set we produced a mean absolute error of $123,117.40 which is 40% of the median predicted housing price. This can be considered accurate, although these considerations are relative to the goal of the project. When looking at our visualization of observed housing prices as a function of predicted housing prices, one can see that we may be slightly underestimating housing prices. This presents a bias in our model that could be adjusted for with further data engineering.

6-2. Generalizability

The generalizability of our model can be identified in four main ways. First, looking at Moran’s I of chapter 5.6, we see that the errors of the current model are rarely clustered. In addition, when visually comparing the distribution of the actual housing price and the predicted price of chapter 5.7, it can be seen that the clustering form is similar. Third, when looking at the Scatterplot of MAPE of chapter 5.9, it can be seen that it is concentrated intensively in one place. Some of these outliers appear to have occurred because there is not much actual housing transaction data. Fourth, when reviewed in the context of race and income, it can be seen that the pattern of housing prices shows a relatively similar pattern. All of these grounds suggest that our model has relatively good generalizability.

Ⅶ. Conclusion

1.Would you recommend your model to Zillow? Why or why not?

While this model has a modest R² = 0.70 (where 70% of the variation in the housing prices in Mecklenburg county can be explained by this model) we do not think that is a high enough number to recommend this model. Our mean average error is $115,126 which is a significant error for Zillow to be risking when basing their business ventures.

2.How might you improve this model?

We would add variables, especially variables depicting trends in recent housing price increases. We would compare the census variables in past years and calculate the rates of housing increases. This variable would have added depth to the study and possible effects of gentrification. These additional variables would ideally improve the accuracy and generalizability of the model. Furthermore, In the process of designing variables, we think there is a high possibility of getting better results if you go through the process of modifying variables using log functions.