Mapping Housing Data with R

What is my home worth?  Many homeowners in America ask themselves this question, and many have an answer.  What does the market think, though?  The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach.  In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.

My day job allows me access to the local multiple listing service, which provides a wealth of historic data.  The following project makes use of that data to map real estate that has sold near Raleigh, NC in the past six months.  Without getting too lost in the weeds I’ll go over a few key parts of the process.  Feel free to jump over to my GitHub page to check out the full source code.  Click here to view the app!

  1. Geocode everything.  The data did not come with latitude and longitude coordinates, so we’ll have to do some geocoding.  I haven’t found an efficient way to do this in R, so, like in the mailing list example, I’ll use QGIS to process my data and return a .csv for each town I’m interested in.
    Screen Shot 2017-03-12 at 5.43.28 PM
  2. Setup your data.  To make sure that everything runs smoothly later on, we’ve got to import our data using readr and make sure each attribute is typed properly.
    library(readr)
    apex <- read_csv("apex2.csv")
    
    #Remove non-character elements from these columns.
    df$`Sold Price` <- as.numeric(gsub("[^0-9]","",df$`Sold Price`))
    df$`List Price` <- as.numeric(gsub("[^0-9]","",df$`List Price`))
    
    #Some re-typing for later.
    df$Fireplace <- as.numeric(df$Fireplace)
    df$`New Constr` <- as.factor(df$`New Constr`)
    
    #Assign some placeholders.
    assign("latitude", NA, envir = .GlobalEnv)
    assign("longitude", NA, envir = .GlobalEnv)
    
  3. Get info from the user.  The first thing the app wants you to do is give it some characteristics about the subject property, a property that you are interested in valuating.  A function further down uses this information to produce a map using these inputs.
     #What city's dataset are we using?
     selectInput("city", label = "City", c("Apex", "Cary", "Raleigh"))
    
     #Get some info.
     textInput("address",label = "Insert Subject Property Address", value = "2219 Walden Creek Drive"),
     numericInput("dist", label = "Miles from Subject", value = 5, max = 20),
     numericInput("footage",label = "Square Footage", value = 2000),
     selectInput("acres",label = "How Many Acres?", acresf)
    
     #Changes datasets based on what city you choose on the frontend.
     #This expression is followed by two more else if statements.
    observeEvent(input$city, {
     if(input$city == "Apex") {
     framework_retype(apex)
     cityschools <-schoolsdf$features.properties %>%
     filter(ADDRCITY_1 == "Apex")
     assign("cityschools", cityschools, envir = .GlobalEnv)
    
     #Draw the map on click.
     observeEvent(input$submit, {
     output$map01 <- renderLeaflet({distanceFrom(input$address, input$footage, input$acres,tol = .15, input$dist)
     })
     })
    
    
  4. Filter the data.  The distanceFrom function above uses dplyr to filter the properties in the selected city by square footage, acreage, and distance from the subject property.  The tol argument is used to give a padding around square footage – few houses match exactly in that respect.
     #Filter once.
     houses_filtered <- houses %>%
      filter(Acres == acres)%>%
      filter(LvngAreaSF >= ((1-tol)*sqft)) %>%
      filter(LvngAreaSF <= ((1+tol)*sqft))
    
     #This grabs lat & long from Google.
     getGeoInfo(subj_address)
     longitude_subj <- as.numeric(longitude)
     latitude_subj <- as.numeric(latitude)
    
     #Use the comparable house locations.
     xy <- houses_filtered[,1:2]
     xy <- as.matrix(xy)
    
     #Calculate distance.
     d <- spDistsN1(xy, c(longitude_subj, latitude_subj), longlat = TRUE)
     d <- d/1.60934
     d <- substr(d, 0,4)
    
     #Filter again.
     distance <- houses_filtered %>%
      filter(distanceMi <= dist)
    
  5. Draw the map. The most important piece, the map, is drawn using Leaflet.  I have the Schools layer hidden initially because it detracts from the main focus – the houses.
    map <- leaflet() %>%
     addTiles(group = "Detailed") %>%
     addProviderTiles("CartoDB.Positron", group = "Simple") %>%
     addAwesomeMarkers(lng = longitude, lat = latitude, popup = subj_address, icon = awesomeIcons(icon='home', markerColor = 'red'), group = "Subject Property") %>%
     addAwesomeMarkers(lng = distance$X, lat = distance$Y, popup = paste(distance$Address,distance$`Sold Price`, distance$distanceMi, sep = ""), icon = awesomeIcons(icon = 'home', markerColor = 'blue'), group = "Comps")%>%
     addAwesomeMarkers(lng = schoolsdf$long, lat = schoolsdf$lat, icon = awesomeIcons(icon = 'graduation-cap',library = 'fa', markerColor = 'green', iconColor = '#FFFFFF'), popup = schoolsdf$features.properties$NAMELONG, group = "Schools")%>%
      addLayersControl(
       baseGroups = c("Simple", "Detailed"),
       overlayGroups = c("Subject Property", "Comps", "Schools"),
       options = layersControlOptions(collapsed = FALSE))
    
    map <- map %>% hideGroup(group = "Schools") 
  6. Regression model.  The second tab at the top of the page leads to more information input that is used in creating a predictive model for the subject property.  The implementation is somewhat messy, so if you’d like to check it out, the code is at the bottom of app.R in the GitHub repo.

That’s it!  It took a while to get all the pieces together, but I think the final product is useful and I learned a lot along the way.  There are a few places I want to improve: simplify the re-typing sections, make elements refresh without clicking submit, among others.  If you have any questions about the code please leave a comment or feel free to send me an email.

Happy coding,

Kiefer Smith

 

 

 

 

Mapping Happiness and Isoline Functions

heart-of-texas-hot-air-balloon

Most of the time I get emails they’re either work-related or spam-related.  Sometimes the spam turns out to be interesting.  About once a month I’ll get a digest of articles from Teleport .  This month there was an article from Forbes about mapping global happiness using news headlines.  I’m assuming the author used natural language processing of some sort, as he mentions evaluating the context in which each location is written about ( sentiment analysis).

Not entirely sure how accurate the methodology is (and the final product is somewhat hard to draw conclusions from), but it’s a super cool concept nonetheless.  Unfortunately, the author did not leave us with a GitHub repo to pore through, but did mention making use of Google’s BigQuery platform and Carto’s mapping system.

Being the fantastic procrastinator that I am, I took a look at Carto’s services.  Turns out they have a pretty cool feature (with an API) that creates time and distance isolines.  Might try using something like that in an upcoming project.  Stay tuned!  Or check out my GitHub for a sneak peek.

R Weekly

FyLlO0UU.jpg

During my Monday morning ritual of avoiding work,  I found this publication that is written in R, for people who use R – R Weekly.  The authors do a pretty awesome job of aggregating useful, entertaining, and informative content about what’s happening surrounding our favorite programming language.  Check it out, give the authors some love on GitHub, and leave a like if you find something useful there.

Have a good week,

Kiefer Smith

Creating a Mailing List in QGIS and R

My day job as a real estate agent requires a myriad of skills, ranging from accounting to negotiation to business analysis.  Frequently (about every three months) I whip out my marketing skills to advertise my business.  This time I decided to send out postcards to an entire neighborhood in which I had sold homes recently.  Typically, agents will buy a mail route from the post office and hand over their postcards.  In the spirit of frugality and proving a point, I cracked my knuckles and went hunting for data.

Get the shapefiles.  Wake County Open Data (or your local open data hub) has a wealth of county-level data including subdivision boundaries and individual address points.  Download both shapefiles and  load them into your favorite GIS program.  This step can probably be done in R, but I find using QGIS fairly intuitive and much faster at plotting large shapefiles.

Screen Shot 2017-02-15 at 11.07.05 AM.png     Screen Shot 2017-02-15 at 11.10.33 AM.png

Filter the addresses.  After loading the address and subdivision shapefiles into QGIS, clip the address shapefile using the subdivision shapefile to save the addresses of interest in a new layer.  Save that puppy as a .csv and we can load it up in R.

Screen Shot 2017-02-15 at 11.24.11 AM.png

Manipulate in R.  Now we’ve got the info we want.  A few lines of code will give us something the post office (or Excel) will understand.

screen-shot-2017-02-15-at-11-35-42-am

walden_creek <- read_csv("~/Desktop/walden creek.csv")
attach(walden_creek)
adds <- paste(FULLADDR, POSTAL_CIT, "NC", "27523", sep = ",")
detach(walden_creek)
write.table(adds, "adds.csv", sep = ",")

Short and sweet, but I thought this was an interesting way to use data for a practical purpose.  People seem to be using R in exciting ways these days – if you see any creative, different projects please share.

– Kiefer Smith

Raleigh Permit Trends

Screen Shot 2017-01-23 at 9.42.55 AM.png
Click here for an interactive version.

I’ve been looking into development trends in Raleigh lately using open data.  Here’s a historical look at building permits over the past seven years using Plotly.  What trends do you see?

In my development environment the graph was stacked bars (far easier on the eyes), but when I uploaded it to the hosting site the bars ended up side-by-side.  Also, I could probably have incorporated some sort of sorting algorithm to make the bars look nicer.

Have suggestions for a visualization?  Leave a comment!

-Kiefer

What is the real unemployment rate? Or, how cherry-picking data is dishonest.

This video by the vlogbrothers does a great job at breaking down what the unemployment rate in America really means, but also exemplifies how reporting data improperly can cause issues.  There are six main ways(probably more) to measure unemployment!  Mixing measurement types or presenting one out of context can distort what is happening in reality.

Keep those critical thinking skills sharp, folks.  We’re gonna need them.

– Kiefer

Raleigh New Builds

Over the past ten years, Raleigh and the outlying areas have been growing at a breakneck pace.  The availability of jobs and the awesome local amenities have made the Oak City one of the most desirable places in the country to live.  Where are all these people living?  Some of them get lucky in the hot resale market, but many are choosing to build new.  In this graph, I used data from Raleigh Open Data and plot_ly in R to display the builders who completed the most new builds in 2016.  Click the picture for an interactive version!

COs2016

Initial Node

Stories are an integral part of the human experience.  They don’t have to be true, but there is something that about seeing “based on a true story” that grabs our attention.  Fiction is entertaining and can be instructive, but a true story we can really connect with.  A true story takes place in our universe – a place where we know the rules.

Finding truth has become increasingly difficult, a problem that became well-known during the 2016 election season.  My goal with this blog (and in life) is to tell interesting stories that are based in fact, based on data.  While no science is infallible, I feel that a story based on numbers is as close to “based on a true story” as we can get.

I mainly use R for data processing and analysis, but I’m slowly learning Python.  These will be my tools for telling stories.  If you have a burning question or find some interesting data, I’d be happy to dig into it.  As this blog progresses I hope to improve my data science skills and hopefully get better at writing blog posts.

– Kiefer