Which linear model is best?


Recently I have been working on a Kaggle competition where participants are tasked with predicting Russian housing prices. In developing a model for the challenge, I came across a few methods for selecting the best regression
model for a given dataset. Let’s load up some data and take a look.

swiss <- data.frame(datasets::swiss)
## [1] 47  6
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

This data provides data on Swiss fertility and some socioeconomic
indicators. Suppose we want to predict fertility rate. Each observation
is as a percentage. In order to assess our prediction ability, create
test and training data sets.

index <- sample(nrow(swiss), 5)
train <- swiss[-index,]
test <- swiss[index,]

Next, have a quick look at the data.

plot(train$Education, train$Fertility)
plot(train$Catholic, train$Fertility)
plot(train$Infant.Mortality, train$Fertility)


Looks like there could be some interesting relationships here. The
following block of code will take a model formula (or model matrix) and
search for the best combination of predictors.

leap <- regsubsets(Fertility~., data = train, nbest = 10)
leapsum <- summary(leap)
plot(leap, scale = 'adjr2')

leap plot

According to the adjusted R-squared value (larger is better), the best
two models are: those with all predictors and all predictors less
Examination. Both have adjusted R-squared values around .69 – a decent fit. Fit these models so we can evaluate them further.

swisslm <- lm(Fertility~., data = train)
swisslm2 <- lm(Fertility~.-Examination, data = train)
#use summary() for a more detailed look.

First we'll compute the mean squared error (MSE).

mean((train$Fertility-predict(swisslm, train))[index]^2)
## [1] 44.21879
mean((train$Fertility-predict(swisslm2, train))[index]^2)
## [1] 36.4982

Looks like the model with all predictors is marginally better. We're
looking for a low value of MSE here. We can also look at the Akaike
information criterion (AIC) for more information. Lower is better here
as well.

step1 <- stepAIC(swisslm, direction = "both")
## Stepwise Model Path 
## Analysis of Deviance Table
## Initial Model:
## Fertility ~ Agriculture + Examination + Education + Catholic + 
##     Infant.Mortality
## Final Model:
## Fertility ~ Agriculture + Education + Catholic + Infant.Mortality
##            Step Df Deviance Resid. Df Resid. Dev      AIC
## 1                                  36   1746.780 168.5701
## 2 - Examination  1 53.77608        37   1800.556 167.8436

Here, the model without Examination scores lower than the full
model. It seems that both models are evenly matched, though I might be
inclined to use the model without Examination.

Since we sampled our original data to make the train and test datasets,
the difference in these tests is subject to change based on the training
data used. I encourage anyone who wants to test themselves to change the
set.seed at the beginning and evaluate the above results again. Even
better: use your own data!

If you learned something or have questions feel free to leave a comment
or shoot me an email.

Happy modeling,



Using R as a GIS


In real estate, spatial data is the name of the game. Countless programs
in other domains utilize the power of this data, which is becoming more
prevalent by the day.

In this post I will go over a few simple, but powerful tools to get you
started using using geographic information in R.

##First, some libraries##
#install.packages('GISTools', dependencies = T)

GISTools provides an easy-to-use method for creating shading schemes
and choropleth maps. Some of you may have heard of the sp package,
which adds numerous spatial classes to the mix. There are also functions
for analysis and making things look nice.

Let’s get rolling: source the vulgaris dataset, which contains
location information for Syringa Vulgaris (the Lilac) observation
stations and US states. This code plots the states and vulgaris

data(&quot;vulgaris&quot;) #load data
par = (mar = c(2,0,0,0)) #set margins of plot area
plot(vulgaris, add = T, pch = 20)

alt text

One thing to note here is the structure of these objects. us_states is
a SpatialPolygonsDataFrame, which stores information for plotting shapes
(like a shapefile) within its attributes. vulgaris by contrast is a
SpatialPointsDataFrame, which contains data for plotting individual
points. Much like a data.frame and $, these objects harbor
information that can be accessed via @.

Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146

Let’s take a look at some functions that use this data.

newVulgaris kable(head(data.frame(newVulgaris)))
x y
3 4896 -67.65 44.65
3 4897 -67.65 44.65
3 4898 -67.65 44.65
3 4899 -67.65 44.65
3 4900 -67.65 44.65
3 4901 -67.65 44.65

gIntersection, as you may have guessed from the name, returns the
intersection of two spatial objects. In this case, we are given the
points from vulgaris that are within us_states. However, the rest of
the vulgaris data has been stripped from the resulting object. We’ve
got to jump through a couple of hoops to get that information back.

&lt;br /&gt;newVulgaris &lt;- data.frame(newVulgaris)
tmp &lt;- rownames(newVulgaris)
tmp &lt;- strsplit(tmp, &quot; &quot;)
tmp &lt;- (sapply(tmp, &quot;[[&quot;, 2))
tmp &lt;- as.numeric(tmp)
vdf &lt;- data.frame(vulgaris)
newVulgaris &lt;- subset(vdf, row.names(vdf) %in% tmp)
Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev Long.1 Lat.1 optional
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE

Look familiar? Now we’ve got a data frame with the clipped vulgaris
values and original data preserved.

vulgarisSpatial ```

After storing our clipped data frame as a SpatialPointsDataFrame, we can
again make use of it - in this case we add a shading scheme to the
`vulgaris` points.

``` r
shades shades$cols plot(us_states)
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)

alt text

Colors are pretty, but what do they mean? Let’s add a legend.

us_states@bbox #Get us_states bounding box coordinates.
 ##min max
 ## r1 -124.73142 -66.96985
 ## r2 24.95597 49.37173
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)
par(xpd=TRUE) #Allow plotting outside of plot area.
choro.legend(-124, 30, shades, cex = .75, title = &quot;Elevation in Meters&quot;) # Plot legend in bottom left. Takes standard legend() params.

alt text

It looks like there’s a lot going on in the Northeastern states. For a
closer look, create another clipping (like above) and plot it. Using the
structure below, we can create a selection vector. I have hidden the
full code since it is repetitive (check GitHub for the full code.)

index '...'
choropleth(vulgarisNE, vulgarisNE$Elev,shading = shades, add = T, pch = 20)
par(xpd = T)
choro.legend(-73, 39.75, shades, cex = .75, title = &quot;Elevation in Meters&quot;)

alt text

Hopefully this has been a useful introduction (or refresher) on spatial
data. I always learn a lot in the process of writing these posts. If you
have any ideas or suggestions please leave a comment or feel free to
contact me!

Happy mapping,


Mapping Housing Data with R

What is my home worth?  Many homeowners in America ask themselves this question, and many have an answer.  What does the market think, though?  The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach.  In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.

My day job allows me access to the local multiple listing service, which provides a wealth of historic data.  The following project makes use of that data to map real estate that has sold near Raleigh, NC in the past six months.  Without getting too lost in the weeds I’ll go over a few key parts of the process.  Feel free to jump over to my GitHub page to check out the full source code.  Click here to view the app!

  1. Geocode everything.  The data did not come with latitude and longitude coordinates, so we’ll have to do some geocoding.  I haven’t found an efficient way to do this in R, so, like in the mailing list example, I’ll use QGIS to process my data and return a .csv for each town I’m interested in.
    Screen Shot 2017-03-12 at 5.43.28 PM
  2. Setup your data.  To make sure that everything runs smoothly later on, we’ve got to import our data using readr and make sure each attribute is typed properly.
    apex <- read_csv("apex2.csv")
    #Remove non-character elements from these columns.
    df$`Sold Price` <- as.numeric(gsub("[^0-9]","",df$`Sold Price`))
    df$`List Price` <- as.numeric(gsub("[^0-9]","",df$`List Price`))
    #Some re-typing for later.
    df$Fireplace <- as.numeric(df$Fireplace)
    df$`New Constr` <- as.factor(df$`New Constr`)
    #Assign some placeholders.
    assign("latitude", NA, envir = .GlobalEnv)
    assign("longitude", NA, envir = .GlobalEnv)
  3. Get info from the user.  The first thing the app wants you to do is give it some characteristics about the subject property, a property that you are interested in valuating.  A function further down uses this information to produce a map using these inputs.
     #What city's dataset are we using?
     selectInput("city", label = "City", c("Apex", "Cary", "Raleigh"))
     #Get some info.
     textInput("address",label = "Insert Subject Property Address", value = "2219 Walden Creek Drive"),
     numericInput("dist", label = "Miles from Subject", value = 5, max = 20),
     numericInput("footage",label = "Square Footage", value = 2000),
     selectInput("acres",label = "How Many Acres?", acresf)
     #Changes datasets based on what city you choose on the frontend.
     #This expression is followed by two more else if statements.
    observeEvent(input$city, {
     if(input$city == "Apex") {
     cityschools <-schoolsdf$features.properties %>%
     filter(ADDRCITY_1 == "Apex")
     assign("cityschools", cityschools, envir = .GlobalEnv)
     #Draw the map on click.
     observeEvent(input$submit, {
     output$map01 <- renderLeaflet({distanceFrom(input$address, input$footage, input$acres,tol = .15, input$dist)
  4. Filter the data.  The distanceFrom function above uses dplyr to filter the properties in the selected city by square footage, acreage, and distance from the subject property.  The tol argument is used to give a padding around square footage – few houses match exactly in that respect.
     #Filter once.
     houses_filtered <- houses %>%
      filter(Acres == acres)%>%
      filter(LvngAreaSF >= ((1-tol)*sqft)) %>%
      filter(LvngAreaSF <= ((1+tol)*sqft))
     #This grabs lat & long from Google.
     longitude_subj <- as.numeric(longitude)
     latitude_subj <- as.numeric(latitude)
     #Use the comparable house locations.
     xy <- houses_filtered[,1:2]
     xy <- as.matrix(xy)
     #Calculate distance.
     d <- spDistsN1(xy, c(longitude_subj, latitude_subj), longlat = TRUE)
     d <- d/1.60934
     d <- substr(d, 0,4)
     #Filter again.
     distance <- houses_filtered %>%
      filter(distanceMi <= dist)
  5. Draw the map. The most important piece, the map, is drawn using Leaflet.  I have the Schools layer hidden initially because it detracts from the main focus – the houses.
    map <- leaflet() %>%
     addTiles(group = "Detailed") %>%
     addProviderTiles("CartoDB.Positron", group = "Simple") %>%
     addAwesomeMarkers(lng = longitude, lat = latitude, popup = subj_address, icon = awesomeIcons(icon='home', markerColor = 'red'), group = "Subject Property") %>%
     addAwesomeMarkers(lng = distance$X, lat = distance$Y, popup = paste(distance$Address,distance$`Sold Price`, distance$distanceMi, sep = ""), icon = awesomeIcons(icon = 'home', markerColor = 'blue'), group = "Comps")%>%
     addAwesomeMarkers(lng = schoolsdf$long, lat = schoolsdf$lat, icon = awesomeIcons(icon = 'graduation-cap',library = 'fa', markerColor = 'green', iconColor = '#FFFFFF'), popup = schoolsdf$features.properties$NAMELONG, group = "Schools")%>%
       baseGroups = c("Simple", "Detailed"),
       overlayGroups = c("Subject Property", "Comps", "Schools"),
       options = layersControlOptions(collapsed = FALSE))
    map <- map %>% hideGroup(group = "Schools") 
  6. Regression model.  The second tab at the top of the page leads to more information input that is used in creating a predictive model for the subject property.  The implementation is somewhat messy, so if you’d like to check it out, the code is at the bottom of app.R in the GitHub repo.

That’s it!  It took a while to get all the pieces together, but I think the final product is useful and I learned a lot along the way.  There are a few places I want to improve: simplify the re-typing sections, make elements refresh without clicking submit, among others.  If you have any questions about the code please leave a comment or feel free to send me an email.

Happy coding,

Kiefer Smith





Initial Node

Stories are an integral part of the human experience.  They don’t have to be true, but there is something that about seeing “based on a true story” that grabs our attention.  Fiction is entertaining and can be instructive, but a true story we can really connect with.  A true story takes place in our universe – a place where we know the rules.

Finding truth has become increasingly difficult, a problem that became well-known during the 2016 election season.  My goal with this blog (and in life) is to tell interesting stories that are based in fact, based on data.  While no science is infallible, I feel that a story based on numbers is as close to “based on a true story” as we can get.

I mainly use R for data processing and analysis, but I’m slowly learning Python.  These will be my tools for telling stories.  If you have a burning question or find some interesting data, I’d be happy to dig into it.  As this blog progresses I hope to improve my data science skills and hopefully get better at writing blog posts.

– Kiefer