Geospatial clustering is the method of grouping a set of spatial objects into groups called “clusters”. Objects within a cluster show a high degree of similarity, whereas the clusters are as much dissimilar as possible. The goal of clustering is to do a generalization and to reveal a relation between spatial and non-spatial attributes.
Let’s understand spatial clustering with a small example.
Suppose you are the head of a food delivery chain in a metro city and you want to find out the preferences of the customers so that you can scale up your business. Since it’s not feasible to look at the details of each and every customer hence you segregate them into groups and strategize a business plan for each of the groups/cluster .
Spatial clustering can be divided into five broad types which are as follows :
1. Partition clustering
2. Hierarchical clustering
3. Fuzzy clustering
4. Density-based clustering
5. Model-based clustering
Let’s take a deep dive and understand each of them in detail!
A partition clustering is a segregation of the data points into non-overlapping subsets (clusters) such that each data point is in exactly one subset. Basically, it classifies the data into groups by satisfying these two requirements :
1. Each data point belongs to one cluster only.
2. Each cluster has at least one data point.
Partition clustering are of three types: K-means clustering, K-medoids clustering/PAM and CLARA (Classification Large Application )
1. K-means Clustering
K-means clustering is a partitioning method and this method decomposes the dataset into a set of K-partitions based on their attributes. You can read more about k means here.
This is an extension of the PAM method for large datasets. Instead of finding medoids for the entire data set, CLARA considers a small sample of the data with fixed size and applies the algorithm to generate an optimal set of medoids for the sample.
Usually, partition-based clustering is used to find groups that have not been explicitly labeled in the data. It helps to assign any new data point to the correct cluster. Businesses use partition-based clustering to segment purchase history, group inventory by sales activity, identifying groups in health monitoring, etc.
Hierarchical clustering is a clustering method like partition-based clustering but the way it classifies the data points is different. It first considers each data point to be a separate cluster. Then merges the most similar cluster which is close to each other. It keeps iterating till all clusters are merged.
The two types of hierarchical clustering are as follows: Agglomerative and Divisive
Agglomerative Hierarchical Clustering
This works with a simple algorithm by which the proximity matrix of points/clusters is calculated. At every iteration, it merges the closest points/clusters are merged and the proximity matrix is updated. This is continued till one cluster or k-clusters are formed. This can be considered as a “bottom-up” approach.
Though hierarchical clustering can be computationally expensive but produces intuitive results. It is useful when not much is known about the data at hand since hierarchical clustering doesn’t need many assumptions. One real-life scenario where hierarchical clustering is extensively is for the mapping of viruses during the epidemic spread and for customer segmentation in the banking industry or retail industry.
In fuzzy c-means clustering, we find out the centroid of the data points and then calculate the distance of each data point from the given centroids until the clusters formed becomes constant. It is different from partition-based clustering in a way that it allows data points to be partially classified into more than one cluster.
Every data point can theoretically belong to all groups with a membership function ranging between 0 and 1. 0 is where the data point is at the farthest possible point from a cluster’s center and 1 is where the data point is closest to the center.
Fuzzy clustering is useful when you need to do image segmentation or when your goal is to segment water, vegetation and rock areas in satellite images. It is useful in cases where the number of clusters can’t be decided apriori. In such cases, clusters with weak boundaries can be merged. Fuzzy clustering is computationally expensive as compared to K-means since for each point is calculates the probability of it belonging to each cluster.
Density-based clustering works by grouping regions of high density and separating them from regions of low density. The most well known density-based clustering algorithm is the DBSCAN algorithm (Density-based spatial clustering with the application of noise ).
The density is calculated by using two parameters which are as follows
1.EPS: This defines the neighborhood around the data point i.e if the distance between two points is less than or equal to eps then they are said to be neighbors
2. MinPts: THis defines the minimum number of data points that form a neighborhood. The size of the dataset and the value of MinPts are directly proportional.
DBSCAN algorithm visits every point and if it contains MinPts within eps then cluster formation starts. Any other point is defined as noise. This process continues till a density connected cluster is formed and then it restarts with a new point.
DBSCAN is mostly used for clustering in planar space. Good results can be achieved if it is used for mapping the effect of natural disasters or plotting the location of weather stations in a city. This can also be used when the data is composed of non-discrete points and is good for handling outliers. Recommender engines/systems make use of DBSCAN to recommend products/shows to their customers.
This method of clustering uses certain models for clusters and tries to optimize the fit between the data and the models. In the model-based clustering approach, the data are viewed as coming from a mixture of probability distributions, each of which represents a different cluster. In other words, in model-based clustering, it is assumed that the data are generated by a mixture of probability distributions in which each component represents a different cluster. Each component (i.e., cluster) is modeled by either a normal distribution or Gaussian distribution.
Expectation-maximization is a well-known model-based clustering algorithm. A particular clustering algorithm is said to work well when the data conforms to the model.
This is useful in incases where clusters have weak borders and data points have mixed membership between clusters. It is also much more flexible in terms of cluster covariance. Cluster assignment is much more flexible in this case and the clusters take any shape depending on the distribution.
At Locale, we are big fans of geospatial data and passionate about the community. You can check out similar tutorials here in this series here: