Clustering in Data Mining

As we mentioned before in our introductionary article, there are three steps for a complete process of information mining.

The data mining task is performed after a previous data processment, composed by gathering and cleaning data from a data warehouse, and is composed by a set of well defined exploration algorithms.

In the process of analysing large sets of data in this step, one of the most used concepts is the aggregation of similar objects within the dataset. This method is an important one and is usually called as clustering in data mining.

Cluster analysis is a basic task used on data analysis fields like statistical exploration, machine learning or every kind of pattern oriented field.

It represents a single technique within the world of data mining, which can be implemented in several different ways. In this resource we approach all the concepts of clustering, providing a better understanding of this methodology.

Let’s jump into that!

What is Clustering?

Clustering is the grouping of a particular set of objects based on their characteristics, aggregating them according to their similarities. Regarding to data mining, this metodology partitions the data implementing a specific join algorithm, most suitable for the desired information analysis.

This clustering analysis allows an object not to be part of a cluster, or strictly belong to it, calling this type of grouping hard partitioning. In the other hand, soft partitioning states that every object belongs to a cluster in a determined degree. More specific divisions can be possible to create like objects belonging to multiple clusters, to force an object to participate in only one cluster or even construct hierarchical trees on group relationships.

There are several different ways to implement this partitioning, based on distinct models. Distinct algorithms are applied to each model, diferentiating it’s properties and results. These models are distinguished by their organization and type of relantionship between them. The most important ones are:

Centralized – each cluster is represented by a single vector mean, and a object value is compared to these mean values
Distributed – the cluster is built using statistical distributions
Connectivity – he connectivity on these models is based on a distance function between elements
Group – algorithms have only group information
Graph – cluster organization and relationship between members is defined by a graph linked structure
Density – members of the cluster are grouped by regions where observations are dense and similar

Clustering Algorithms in Data Mining

Based on the recently described cluster models, there are a lot of clustering that can be applied to a data set in order to partitionate the information. In this article we will briefly describe the most important ones. It is important to mention that every method has it’s advantages and cons. The choice of algorithm will always depende on the characteristics of the data set and what we want to do with it.

Centroid-based

In this type os grouping method, every cluster is referenced by a vector of values. Each object is part of the cluster whose value difference is minimal, comparing to other clusters. The number of clusters should be pre-defined, and this is the biggest problem of this kind of algorithms. This methodology is the most close to the classification subject and are vastly used for optimization problems.

Distributed-based

Related to pre-defined statistical modelks, the distributed methodology combines objects whose values belogs to the same distribution. Because of its random nature of value generation, this process needs a well defined and complex model to interact in a btter way with real data. However these processes can achived a optimal solution and calculate correlations and dependencies.

Connectivity-based

On this type of algorithm, every object is related to its neighbours, depending the degree of that relationship on the distance between them. Based on this assumption, clusters are created with near by objects, and can be described as a maximum distance limit. With this relationship between members, these clusters have hierarquical representations. The distance function varies on the focus of the analysis.

Density-based

These algorithms create clusters according to the high density of members of a data set, in a determined location. It aggregates some distance notion to a density standard level to group members in clusters. These kind of processes may have less performance on detecting the limit areas of the group.

Cluster Analysis main Applications

Since this is a very valuable data analysis technique, it has several different applications in the sciences world. Every large data set of information can be processed by this kind of analysis, producing great results with many distinct types of data.

One of the most importat application is related to image processing. detecting distinct kinds of pattern in image data. This can be very effective in biology research, distinguishing objects and identifying patterns. Another use is the classification of medical exams.

The personal data combined with shopping, location, interest, actions and a infinite number of indicators, can be analysed with this methodology, providing very important information and trends. Examples of this are the market research, marketing strategies, web analytics, and a lot of others.

Other types of applications based on clustering algorithms are climatology, robotics, recommender systems, mathematical and statistical analysis, providing a broad spectrum of utilization.