Exploring the World of Hexagonal Bin Plots

Xiangyu Wang
6 min readJan 15, 2021

by Xiangyu Wang

with Greg Page

For understanding the relationship between two numeric variables, scatterplots are a valuable tool. They can inform us about the correlations (positive or negative) and the types of relationships (linear or non-linear) among variables. In addition, they can help with outlier detection and with understanding the overall distribution of the individual variables that they show.

When we wish to analyze truly massive datasets, however, scatterplots’ value can be limited because of a problem known as overplotting. Overplotting is the result of too many data points landing atop one another, thereby rendering many of the individual points indistinguishable. While there are several approaches that can be taken to address this issue (among these are jittering, transparency adjustments, scale adjustments, and sampling), this article will focus on the use of an alternative graphing method known as the hexagonal bin plot.

This dataset used in this article, vehicles.csv, contains information about more than 400,000 used car listings in the United States, scraped from the website craigslist. This dataset is available on Kaggle. After using some library() functions to bring a few packages into our environment, we imported the file as shown below:

library(ggplot2)

library(naniar)

library(tidyverse)

library(hexbin)

used_car <- read.csv(“vehicles.csv”)

str(used_car)

After removing 1050 rows with missing values for the year variable, we filtered the dataset so that it only included cars with sale price values greater than $0 and less than $100,000, and whose year of manufacture was 1920 or later.

missing <- miss_var_summary(used_car)

View(missing)

used_car <- drop_na(used_car, year)

used_car2 <- filter(used_car, price > 0 & price < 100000)

used_car2 <- filter(used_car2, year >= 1920)

ggplot(used_car2, aes(x=year, y=price)) + geom_point() + xlab(“Year of Manufacture”) + ylab(“Sale Price”) + ggtitle(“Relationship Between Year of Manufacture and Price”)

The scatterplot shows us a greater concentration of data points near the right side, but it doesn’t tell us anything about the relative concentration of points in particular places — for example, we can’t look at this plot and know whether the dataset contains more cars priced below $12,500 made in 1980, 2000, or 2020. This is due to the overplotting problem mentioned above — additional data points are simply overlaid atop one another on this scatterplot, so we can’t come away with any true sense of the true relative density of different areas of the plot.

To address this limitation, we can instead present the data with a hexagonal bin plot. In a hexagonal bin plot, the data range is covered by hexagons. Each hexagon is equally-sized, and a color gradient is used to indicate the density of data points that fall inside each hexagon.

Here is the same data shown in the previous figure, but with a hexagonal binplot rather than a scatterplot:

ggplot(used_car2, aes(x=year, y=price)) + geom_hex() + xlab(“Year of Manufacture”) + ylab(“Sale Price”) + ggtitle(“Relationship Between Year of Manufacture and Price”)

The default color gradient used by ggplot shows areas of greater density with lighter colors, as indicated in the legend on the plot above. The legend indicates the number of records that fall into each particular hexagon. This plot shows us a pattern that the scatterplot did not reveal — the greatest density of points here comes from cars made after 2000, whose resale price is less than $25,000. The single-densest region on the plot above is the one that includes cars priced below $12,500, built in 2012 or 2013.

By adjusting the value in the “bins” parameter inside of geom_hex, we can alter the specificity of the graph, rather than just use the default value of 30. In the graph below, the bins parameter is set to 15. This instructs ggplot to size the hexagons in a way in which each one spans approximately 1/15th of the distance along the x-axis.

ggplot(used_car2, aes(x=year, y=price)) + geom_hex(bins=15) + xlab(“Year of Manufacture”) + ylab(“Sale Price”) + ggtitle(“Relationship Between Year of Manufacture and Price”)

The small bin value is useful for delivering a truly “big picture” perspective. It makes it easy to see the areas that stand out, but it also means we lose some details. For instance, in the graph above, we cannot identify the pause in civilian vehicle manufacturing in the early 1940s. By using a larger number of bins, we can gain more specificity. The graph below, which uses bins = 40, offers a much more detailed view than the one above:

ggplot(used_car2, aes(x=year, y=price)) + geom_hex(bins=40) + xlab(“Year of Manufacture”) + ylab(“Sale Price”) + ggtitle(“Relationship Between Year of Manufacture and Price”)

We can further refine the distinctions between each hexagon by setting the scale_fill_gradient to a logarithmic scale. In the graph shown below, the middle shade of blue in the graph below now indicates an area of 100x greater density than the lightest shade of blue (on the previous hexagonal bin plot, the same degree of color difference only represented a 2x difference in density). Also, on this graph, the color gradient has been reversed — now, darker colors indicate areas of greater density, as lighter colors indicate areas of lesser density. Note that all of these changes resulted from modifications to the scale_fill_gradient() function parameters.

ggplot(used_car2,aes(x=year,y=price)) + geom_hex()+ scale_fill_gradient(low=’lightblue’,high=’darkblue’,trans=’log10')+ labs(title=’Relationship Between Year of Manufacture and Price’, x=’’, y=’Manufacture Year’)

Compared to the first hexagonal bin plot, the plot shown immediately above reveals more detail about the data. For instance, it shows us that higher-priced cars tend to be either newly-built models or “classics” that are more than 50 years old.

When making a hexagonal bin plot, an analyst can use any color combination to identify areas of greater or lesser density. The plot shown below uses light green for lower-density areas, while depicting denser areas in crimson:

ggplot(used_car2,aes(x=year,y=price)) + geom_hex() + scale_fill_gradient(low=’#33FFA8',high=’#A91C10') + labs(title=’Relationship Between Year of Manufacture and Price’, x=’Year of Manufacture’, y=’Price’)

We can also overlay a best-fit curve on a hexagonal bin plot. In the plot below, a best-fit curve fitted with local exponential smoothing is shown — this is added with the inclusion of the geom_smooth() function.

ggplot(used_car2,aes(x=year,y=price)) + geom_hex()+ geom_smooth() + scale_fill_gradient(low=’#33FFA8',high=’#A91C10')+ labs(title=’Relationship Between Year of Manufacture and Price’,x=’Year of Manufacture’, y=’Price’)

The shape of the smoothing curve here is consistent with what the log-based hexagonal binplot showed us earlier — there are some instances of higher-priced classic cars in the dataset, and among the cars built after 2000, newer cars tend to be more expensive.

In sum, hexagonal bin plots are a valuable component of the visualization toolkit for any analyst working with large datasets, and seeking to plot variable relationships. Whereas scatterplots are prone to the problem of overplotting, a hexagonal bin plot can offer a viewer a much better sense of the relative density of points in different areas of the graph.

The author is a Master’s student in Applied Business Analytics at Boston University, Metropolitan College. He will graduate in Spring 2021. His co-author is a Senior Lecturer in Applied Business Analytics at Boston University.

--

--

Xiangyu Wang

Area Manager at Amazon, looking for data analyst/business analyst opportunities.