Take your data frames to the next level.

March 31, 2017March 31, 2017 ~ realdataweb ~ 3 Comments

leo

In R-rockstar Hadley Wickham’s book (Free Book – R for Data Science), the section on model building elaborates on something pretty cool that I had no idea about – list columns.

Most of us have probably seen the following data frame column format:

df <- data.frame("col_uno" = c(1,2,3),"col_dos" = c('a','b','c'), "col_tres" = factor(c("google", "apple", "amazon")))

And the output:

df

##   col_uno col_dos col_tres
## 1       1       a   google
## 2       2       b    apple
## 3       3       c   amazon

This is an awesome way to organize data and one of R’s strong points. However, we can use list functionality to go deeper. Check this out:

library(tidyverse)
library(datasets)

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

nested <- iris %>%
 group_by(Species) %>%
 nest()

# A tibble: 3 × 2
 Species          data
 <fctr>          <list>
1 setosa        <tibble [50 × 4]>
2 versicolor    <tibble [50 × 4]>
3 virginica     <tibble [50 × 4]>

Using nest we can compartmentalize our data frame for readability and more efficient iteration. As a simple example, we can use map from the purrr package to compute the mean of each column in our nested data.

means <- map(nested$data, colMeans)

## [[1]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.006        3.428        1.462        0.246 
## 
## [[2]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.936        2.770        4.260        1.326 
## 
## [[3]]
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        6.588        2.974        5.552        2.026

Once you’re done messing around with data-ception, use unnest to revert your data back to its original state.

head(unnest(nested))

## # A tibble: 6 × 5
##   Species Sepal.Length Sepal.Width Petal.Length Petal.Width
##                                  
## 1  setosa          5.1         3.5          1.4         0.2
## 2  setosa          4.9         3.0          1.4         0.2
## 3  setosa          4.7         3.2          1.3         0.2
## 4  setosa          4.6         3.1          1.5         0.2
## 5  setosa          5.0         3.6          1.4         0.2
## 6  setosa          5.4         3.9          1.7         0.4

I was pretty excited to learn about this property of data.frames and will definitely make use of it in the future. If you have any neat examples of nested dataset usage, please feel free to share in the comments. As always, I’m happy to answer questions or talk data!

Kiefer Smith

Mapping Housing Data with R

March 16, 2017March 16, 2017 ~ realdataweb ~ 5 Comments

What is my home worth? Many homeowners in America ask themselves this question, and many have an answer. What does the market think, though? The best way to estimate a property’s value is by looking at other, similar properties that have sold recently in the same area – the comparable sales approach. In an effort to allow homeowners to do some exploring (and because I needed a new project), I developed a small Shiny app with R.

My day job allows me access to the local multiple listing service, which provides a wealth of historic data. The following project makes use of that data to map real estate that has sold near Raleigh, NC in the past six months. Without getting too lost in the weeds I’ll go over a few key parts of the process. Feel free to jump over to my GitHub page to check out the full source code. Click here to view the app!

Geocode everything. The data did not come with latitude and longitude coordinates, so we’ll have to do some geocoding. I haven’t found an efficient way to do this in R, so, like in the mailing list example, I’ll use QGIS to process my data and return a .csv for each town I’m interested in.

Setup your data. To make sure that everything runs smoothly later on, we’ve got to import our data using readr and make sure each attribute is typed properly.

library(readr)
apex <- read_csv("apex2.csv")

#Remove non-character elements from these columns.
df$`Sold Price` <- as.numeric(gsub("[^0-9]","",df$`Sold Price`))
df$`List Price` <- as.numeric(gsub("[^0-9]","",df$`List Price`))

#Some re-typing for later.
df$Fireplace <- as.numeric(df$Fireplace)
df$`New Constr` <- as.factor(df$`New Constr`)

#Assign some placeholders.
assign("latitude", NA, envir = .GlobalEnv)
assign("longitude", NA, envir = .GlobalEnv)

Get info from the user. The first thing the app wants you to do is give it some characteristics about the subject property, a property that you are interested in valuating. A function further down uses this information to produce a map using these inputs.

 #What city's dataset are we using?
 selectInput("city", label = "City", c("Apex", "Cary", "Raleigh"))

 #Get some info.
 textInput("address",label = "Insert Subject Property Address", value = "2219 Walden Creek Drive"),
 numericInput("dist", label = "Miles from Subject", value = 5, max = 20),
 numericInput("footage",label = "Square Footage", value = 2000),
 selectInput("acres",label = "How Many Acres?", acresf)

 #Changes datasets based on what city you choose on the frontend.
 #This expression is followed by two more else if statements.
observeEvent(input$city, {
 if(input$city == "Apex") {
 framework_retype(apex)
 cityschools <-schoolsdf$features.properties %>%
 filter(ADDRCITY_1 == "Apex")
 assign("cityschools", cityschools, envir = .GlobalEnv)

 #Draw the map on click.
 observeEvent(input$submit, {
 output$map01 <- renderLeaflet({distanceFrom(input$address, input$footage, input$acres,tol = .15, input$dist)
 })
 })

Filter the data. The distanceFrom function above uses dplyr to filter the properties in the selected city by square footage, acreage, and distance from the subject property. The tol argument is used to give a padding around square footage – few houses match exactly in that respect.

 #Filter once.
 houses_filtered <- houses %>%
  filter(Acres == acres)%>%
  filter(LvngAreaSF >= ((1-tol)*sqft)) %>%
  filter(LvngAreaSF <= ((1+tol)*sqft))

 #This grabs lat & long from Google.
 getGeoInfo(subj_address)
 longitude_subj <- as.numeric(longitude)
 latitude_subj <- as.numeric(latitude)

 #Use the comparable house locations.
 xy <- houses_filtered[,1:2]
 xy <- as.matrix(xy)

 #Calculate distance.
 d <- spDistsN1(xy, c(longitude_subj, latitude_subj), longlat = TRUE)
 d <- d/1.60934
 d <- substr(d, 0,4)

 #Filter again.
 distance <- houses_filtered %>%
  filter(distanceMi <= dist)

Draw the map. The most important piece, the map, is drawn using Leaflet. I have the Schools layer hidden initially because it detracts from the main focus – the houses.

map <- leaflet() %>%
 addTiles(group = "Detailed") %>%
 addProviderTiles("CartoDB.Positron", group = "Simple") %>%
 addAwesomeMarkers(lng = longitude, lat = latitude, popup = subj_address, icon = awesomeIcons(icon='home', markerColor = 'red'), group = "Subject Property") %>%
 addAwesomeMarkers(lng = distance$X, lat = distance$Y, popup = paste(distance$Address,distance$`Sold Price`, distance$distanceMi, sep = ""), icon = awesomeIcons(icon = 'home', markerColor = 'blue'), group = "Comps")%>%
 addAwesomeMarkers(lng = schoolsdf$long, lat = schoolsdf$lat, icon = awesomeIcons(icon = 'graduation-cap',library = 'fa', markerColor = 'green', iconColor = '#FFFFFF'), popup = schoolsdf$features.properties$NAMELONG, group = "Schools")%>%
  addLayersControl(
   baseGroups = c("Simple", "Detailed"),
   overlayGroups = c("Subject Property", "Comps", "Schools"),
   options = layersControlOptions(collapsed = FALSE))

map <- map %>% hideGroup(group = "Schools")

Regression model. The second tab at the top of the page leads to more information input that is used in creating a predictive model for the subject property. The implementation is somewhat messy, so if you’d like to check it out, the code is at the bottom of app.R in the GitHub repo.

That’s it! It took a while to get all the pieces together, but I think the final product is useful and I learned a lot along the way. There are a few places I want to improve: simplify the re-typing sections, make elements refresh without clicking submit, among others. If you have any questions about the code please leave a comment or feel free to send me an email.

Happy coding,

Kiefer Smith

Mapping Happiness and Isoline Functions

March 4, 2017March 4, 2017 ~ realdataweb ~ Leave a comment

heart-of-texas-hot-air-balloon

Most of the time I get emails they’re either work-related or spam-related. Sometimes the spam turns out to be interesting. About once a month I’ll get a digest of articles from Teleport . This month there was an article from Forbes about mapping global happiness using news headlines. I’m assuming the author used natural language processing of some sort, as he mentions evaluating the context in which each location is written about ( sentiment analysis).

Not entirely sure how accurate the methodology is (and the final product is somewhat hard to draw conclusions from), but it’s a super cool concept nonetheless. Unfortunately, the author did not leave us with a GitHub repo to pore through, but did mention making use of Google’s BigQuery platform and Carto’s mapping system.

Being the fantastic procrastinator that I am, I took a look at Carto’s services. Turns out they have a pretty cool feature (with an API) that creates time and distance isolines. Might try using something like that in an upcoming project. Stay tuned! Or check out my GitHub for a sneak peek.

Real Data

Adventures in Data Science

Month: March 2017

Take your data frames to the next level.

Mapping Housing Data with R

Mapping Happiness and Isoline Functions