Co-author: Rishabh Jain
At Locale.ai, we work with a number of last-mile and hyperlocal and mobility companies. For most of these companies, geospatial analysis is critical and typically, they have internal dashboards built on a BI platform or in-house using open source tools.
One of the reasons they use our product is to highlight what areas to focus on and how to contextualize their strategies to those areas. A caveat here is that these areas are not our traditional defined areas in a city or zip-codes. In this post, we will deep dive into why using traditional area definitions is not a good idea to carry out geospatial analysis.
Maps don’t always tell the truth!
Based on intuitive logic, companies either go with a zip-code boundary or some hand-drawn neighbourhood boundaries to map their most critical metrics like market potential, utilisation, average customer value, etc. These form an important part of decisions like what areas to expand into, shoot promotions in, or provision more supply.
Now, if you’re a business with little or no intra-city operations, the difference might not be significant to you. But if you provide services in the last mile or at a hyperlocal level, the differences in insights have a significant impact on the decisions that the business and city teams use.
Which brings me to the next question: Why is this difference significant?
Often what happens is the zip codes or arbitrarily defined neighbourhood boundaries that the nuances that the different areas depict tend to get dissolved.
In other words, a zip code that we treat as one big cell for analysis consists of many smaller cells without any similarity in demographics or economic potential.
This is illustrated by the fact that, if you live in Palo Alto, you are a part of the world’s foremost innovation hub and are paying median home prices of $1.18 million. However, just right across the train tracks, 18% percent of East Palo Alto residents live below the poverty line where the average yearly income per person is $18,385.
The characteristic behaviour of these two areas would be very different and while they would be present in the data that you collect, they often aren’t so easy to unearth.
In the next section, we would like to show you how easily the maps can lie and why different members in your team recommend completely different areas in a city to focus on based on the same set of metrics!
NYC City Manager’s Woes
Let’s consider that you are Uber’s city manager, and want to run contextual promotions and discounts in areas where you have high market potential or market share. For this exercise, we used the Uber Cab dataset available here and the NYC Yellow Taxi dataset available here. Now, we can simply define market-potential as
Greener means more Market Potential for Uber and pinker means it is already doing well in that area. When we try to filter areas by M.P greater than 70% to identify the top areas to run promotions in, we see the following areas:
The areas we got are quite different!
Modifiable Areal Unit Problem
Hence, it is safe to conclude that the decision completely changes based on what set of boundaries we use for our analysis. This is a classical problem in geospatial data science known as the Modifiable Areal Unit Problem (MAUP).
Because of the MAUP problem, our decision becomes reliant on the shape and size of the area instead of the actual characteristics of the users within it.
The ideal area should have the ideal shape and size. Let’s consider finding that ideal shape. We could divide the geographic plane into squares of the same size, or maybe triangles, pentagons, or something else! We at Locale use hexagons instead of any other shape and you can get a glimpse of why here:
Let’s use hexes to plot the MP for NYC. Uber has a well-suited library that converts lat, longs to hexes of a given size. You can find out more about it here.
From the visualization, we can see the distribution of Market Potential across the city in a much uniform way. We can filter these hexagons by market potential to get the best area to run promotions in.
Another way to reduce more bias would be to not bind locations to areas initially which leads to our next section.
Enter, Geo-spatial Clustering!
The idea is to let data decide what the significant areas are on which MP should ideally be computed. A simple density-based clustering algorithm like DBSCAN could be a good place to start! We have written about the other types of clustering in case you want to check that out:
We cluster all locations based on proximity to each other, hence finding dense clusters. Then, we compute the convex hull of these areas to get area boundaries. Now, we have areas generated from the data itself, suitable for computing MP in a reduced human-bias way. Again, we filter by areas where M.P is more than 70% and get these places:
It’s worth noting that a simple density-based approach can work here because of the nature and definition of the problem.
Since we consider only locations, a density-based spatial clustering extracts the underlying areas which are more closely knit in terms of the user behaviour.
For this decision to be actually reliable for a business use case, instead of using just one metric, we take a set of metrics that largely affect the user behaviour in one area. For example, office-goers in an area vs university students might show completely different travel patterns.
Now, as highlighted here, using the automated learning techniques for similarity and clustering analyses that we have built at Locale.ai, our algorithms work hard to find what areas behave similarly based on the metrics that you care about (growth potential, unit economics, power users) and show you what areas to focus on for each unit of your business.
This enables you to save or export these areas profiles and keep reusing them for different kinds of analysis and then track them to maximize your revenue, demand and profitability in each of those areas rather than using arbitrarily defined boundaries.
The complete experiment along with the code is available here. Feel free to experiment with some other dataset!
If you’ve reached this far, here is a sneak peek into why this forms the base of our entire platform and what all techniques we have developed on top of this:
Similarity Analysis and Feature Importance
Our geospatial models not only find clusters of areas having unique characteristic, but also tell us why the area is unique. Hence, while identifying clusters of areas based on a combination of metrics, it also pin points the factors that have the most significant impact on these metrics.
Two different areas don’t necessarily have to be close in order to behave similarly. You can probably use one as a test area to observe and apply the same strategies across cities.
Real-time Monitoring with Anomaly Detection
It doesn’t make sense is not quite enough to just know that SLAs are not being met only in Bangalore or SFO. Knowing where orders are getting delayed or cancelled, or whether an abnormally high delay in an area is caused by a sudden supply drop because of rain can be very valuable.
Adding the context of location along with anomalies can make the insights very actionable for us and makes decision-making very simple.
At Locale, we are building an analytics and visualization product to help city and business teams get precise, real-time insights about how their operations perform without any compromises and dependencies. then get in touch for a demo here or get in touch with me on LinkedIn or Twitter.