Clustering

Hello, everyone! I’ve been meaning to get a new blog post out for the
past couple of weeks. During that time I’ve been messing around with
clustering. Clustering, or cluster analysis, is a method of data mining
that groups similar observations together. Classification and clustering
are quite alike, but clustering is more concerned with exploration than
an end result.

Note: This post is far from an exhaustive look at all clustering has to
offer. Check out this guide for more. I am reading Data Mining by Aggarwal presently, which is very informative.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

For simplicity, we’ll use the iris dataset. We’re going to try to use
the numeric data to correctly group the observations by species. There
are 50 of each species in the dataset, so ideally we would end up with
three clusters of 50.

library(ggplot2)
ggplot() +
geom_point(aes(iris$Sepal.Length, iris$Sepal.Width, col = iris$Species))

unnamed-chunk-2-1

As you can see, there is already some groupings present. Let’s use
hierarcical clustering first.

iris2 <- iris[,c(1:4)]  #not going to use the `Species` column.
medians <- apply(iris2, 2, median)
mads <- apply(iris2,2,mad)
iris3 <- scale(iris2, center = medians, scale = mads)

dist <- dist(iris3)
hclust <- hclust(dist, method = 'median')  #there are several methods for hclust
cut <- cutree(hclust, 3)

table(cut)
## cut
##  1  2  3 
## 49 34 67
iris <- cbind(iris, cut)
iris$cut <- factor(iris$cut)
levels(iris$cut) <- c('setosa', 'versicolor', 'virginica')
err <- iris$Species == iris$cut
table(err)
## err
## FALSE  TRUE 
##    38   112
ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris$cut))

unnamed-chunk-3-1

Nice groupings here, but it looks like the algorithm has some trouble
determining between versicolor and virginica. If we used this
information to classify the observations, we’d get an error rate of
about .25. Let’s try another clustering technique: DBSCAN.

library(dbscan)
db <- dbscan(iris3, eps = 1, minPts = 20)
table(db$cluster)
## 
##  0  1  2 
##  5 48 97
iris2 <- cbind(iris2, db$cluster)

iris2$cluster <- factor(iris2$`db$cluster`)

ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = iris2$cluster))

unnamed-chunk-5-1

DBSCAN classifies points into three different categories: core, border,
and noise points on the basis of density. Thus, the versicolor/
virginica cluster is treated as one group. Since our data is not
structured in such a way that density is meaningful, DBSCAN is probably
not a wise choice here.

Let’s look at one last algo: the ROCK. No, not that ROCK.

library(cba)
distm <- as.matrix(dist)
rock <- rockCluster(distm, 3, theta = .02)
## Clustering:
## computing distances ...
## computing links ...
## computing clusters ...
iris$rock <- rock$cl
levels(iris$rock) <- c('setosa', 'versicolor', 'virginica')

ggplot() +
geom_point(aes(iris2$Sepal.Length, iris2$Sepal.Width, col = rock$cl))

unnamed-chunk-7-1

err <- (iris$Species == iris$rock)
table(err)
## err
## FALSE  TRUE 
##    24   126

While it may not look like it, the ROCK seems to do the best job at determining
clusters in this data – the error rate dropped to 16%. Typically this
method is reserved for categorical data, but since we used dist it
shouldn't cause any problems.

I have been working on a project using some of these (and similar) data
mining procedures to explore spatial data and search for distinct
groups. While clustering the iris data may not be all that meaningful,
I think it is illustrative of the power of clustering. I have yet to try
higher-dimension clustering techniques, which might perform even better.

Have any comments? Questions? Suggestions for future posts? I am always
happy to hear from readers; please contact me!

Happy clustering,

Kiefer

Advertisements

Using R as a GIS

1

In real estate, spatial data is the name of the game. Countless programs
in other domains utilize the power of this data, which is becoming more
prevalent by the day.

In this post I will go over a few simple, but powerful tools to get you
started using using geographic information in R.

##First, some libraries##
#install.packages('GISTools', dependencies = T)
library(GISTools)

GISTools provides an easy-to-use method for creating shading schemes
and choropleth maps. Some of you may have heard of the sp package,
which adds numerous spatial classes to the mix. There are also functions
for analysis and making things look nice.

Let’s get rolling: source the vulgaris dataset, which contains
location information for Syringa Vulgaris (the Lilac) observation
stations and US states. This code plots the states and vulgaris
points.

data(&quot;vulgaris&quot;) #load data
par = (mar = c(2,0,0,0)) #set margins of plot area
plot(us_states)
plot(vulgaris, add = T, pch = 20)

alt text

One thing to note here is the structure of these objects. us_states is
a SpatialPolygonsDataFrame, which stores information for plotting shapes
(like a shapefile) within its attributes. vulgaris by contrast is a
SpatialPointsDataFrame, which contains data for plotting individual
points. Much like a data.frame and $, these objects harbor
information that can be accessed via @.

kable(head(vulgaris@data))
Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146

Let’s take a look at some functions that use this data.

newVulgaris kable(head(data.frame(newVulgaris)))
x y
3 4896 -67.65 44.65
3 4897 -67.65 44.65
3 4898 -67.65 44.65
3 4899 -67.65 44.65
3 4900 -67.65 44.65
3 4901 -67.65 44.65

gIntersection, as you may have guessed from the name, returns the
intersection of two spatial objects. In this case, we are given the
points from vulgaris that are within us_states. However, the rest of
the vulgaris data has been stripped from the resulting object. We’ve
got to jump through a couple of hoops to get that information back.

&lt;br /&gt;newVulgaris &lt;- data.frame(newVulgaris)
tmp &lt;- rownames(newVulgaris)
tmp &lt;- strsplit(tmp, &quot; &quot;)
tmp &lt;- (sapply(tmp, &quot;[[&quot;, 2))
tmp &lt;- as.numeric(tmp)
vdf &lt;- data.frame(vulgaris)
newVulgaris &lt;- subset(vdf, row.names(vdf) %in% tmp)
Station Year Type Leaf Bloom Station.Name State.Prov Lat Long Elev Long.1 Lat.1 optional
3695 61689 1965 Vulgaris 114 136 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3696 61689 1966 Vulgaris 122 146 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3697 61689 1967 Vulgaris 104 156 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3698 61689 1968 Vulgaris 97 134 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3699 61689 1969 Vulgaris 114 138 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE
3700 61689 1970 Vulgaris 111 135 COVENTRY CT 41.8 -72.35 146 -72.35 41.8 TRUE

Look familiar? Now we’ve got a data frame with the clipped vulgaris
values and original data preserved.

vulgarisSpatial ```

After storing our clipped data frame as a SpatialPointsDataFrame, we can
again make use of it - in this case we add a shading scheme to the
`vulgaris` points.

``` r
shades shades$cols plot(us_states)
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)

alt text

Colors are pretty, but what do they mean? Let’s add a legend.

us_states@bbox #Get us_states bounding box coordinates.
 ##min max
 ## r1 -124.73142 -66.96985
 ## r2 24.95597 49.37173
plot(us_states)
choropleth(vulgarisSpatial, vulgarisSpatial$Elev,shading = shades, add = T, pch = 20)
par(xpd=TRUE) #Allow plotting outside of plot area.
choro.legend(-124, 30, shades, cex = .75, title = &quot;Elevation in Meters&quot;) # Plot legend in bottom left. Takes standard legend() params.

alt text

It looks like there’s a lot going on in the Northeastern states. For a
closer look, create another clipping (like above) and plot it. Using the
structure below, we can create a selection vector. I have hidden the
full code since it is repetitive (check GitHub for the full code.)

index '...'
plot(us_states[index,])
choropleth(vulgarisNE, vulgarisNE$Elev,shading = shades, add = T, pch = 20)
par(xpd = T)
choro.legend(-73, 39.75, shades, cex = .75, title = &quot;Elevation in Meters&quot;)

alt text

Hopefully this has been a useful introduction (or refresher) on spatial
data. I always learn a lot in the process of writing these posts. If you
have any ideas or suggestions please leave a comment or feel free to
contact me!

Happy mapping,

Kiefer

Raleigh Permit Trends

Screen Shot 2017-01-23 at 9.42.55 AM.png
Click here for an interactive version.

I’ve been looking into development trends in Raleigh lately using open data.  Here’s a historical look at building permits over the past seven years using Plotly.  What trends do you see?

In my development environment the graph was stacked bars (far easier on the eyes), but when I uploaded it to the hosting site the bars ended up side-by-side.  Also, I could probably have incorporated some sort of sorting algorithm to make the bars look nicer.

Have suggestions for a visualization?  Leave a comment!

-Kiefer