Kinds of clustering methods

Download 25.5 Kb.
Hajmi25.5 Kb.

Clustering is one of the biggest topics in data science, so big that you will easily find tons of books discussing every last bit of it. The subtopic of text clustering is no exception. This article can therefore not deliver an exhaustive overview, but it covers the main aspects. This being said, let us start by getting on common ground what clustering is and what it isn’t.Objects inside of a cluster should be as similar as possible. Objects in different clusters should be as dissimilar as possible. But who defines what “similar” means? We’ll come back to that at a later point.

Now, you may have heard of classification before. When classifying objects, you also put them into different groups, but there are a few important differences. Classifying means putting new, previously unseen objects into groups based on objects of which the group affiliation is already known, so called training data. This means we have something reliable to compare new objects to — when clustering, we start with a blank canvas: all objects are new! Because of that, we call classification a supervised method, clustering an unsupervised one.

This also means that for classifying the correct number of groups is known, whereas in clustering there is no such number. Note that it is not just unknown — it simply does not exist. It is up to us to choose a suitable amount of clusters for our purpose. Many times, this means trying out a few and then choosing the one which delivered the best results.
Kinds of clustering methods

Before we dive right into concrete clustering algorithms, let us first establish some ways in which we can describe and distinguish them. There are a few ways in which this is possible:

In hard clustering, every object belongs to exactly one cluster. In soft clustering, an object can belong to one or more clusters. The membership can be partial, meaning the objects may belong to certain clusters more than to others.

In hierarchical clustering, clusters are iteratively combined in a hierarchical manner, finally ending up in one root (or super-cluster, if you will). You can also look at a hierarchical clustering as a binary tree. All clustering methods not following this principle can simply be described as flat clustering, but are sometimes also called non-hierarchical or partitional. You can always convert a hierarchical clustering into a flat one by “cutting” the tree horizontally on a level of your choice.

The tree-shaped diagram of a hierarchical clustering is called a dendrogram. Objects connected on a lower hierarchy are more similar than objects connected high up the tree.

Hierarchical methods can be further divided into two subcategories. Agglomerative (“bottom up”) methods start by putting each object into its own cluster and then keep unifying them. Divisive (“top down”) methods do the opposite: they start from the root and keep dividing it until only single objects are left.

The clustering process

It should be clear how the clustering process looks like, right? You take some data, apply the clustering algorithm of your choice and ta-da, you are done! While this might theoretically be possible, it is usually not the case. Especially when working with text, there are several steps you have to take prior to and after clustering. In reality the process of clustering text is often messy and marked by many unsuccessful trials. However, if you tried to draw it in an idealized, linear manner, it might look like this:

Quite a few extra steps, right? Don’t worry — you would probably have intuitively done it right anyway. However, it is helpful to consider each step on its own and keep in mind that alternative options for solving the problem might exist.
Download 25.5 Kb.

Do'stlaringiz bilan baham:

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan © 2024
ma'muriyatiga murojaat qiling