Reference no: EM133238409
Case: Clustering is one of the core techniques in data mining which allows us to identify similarities and patterns in data. Analyzing the existing clustering techniques for data, you will notice that the clustering problem can be defined as grouping records into n-groups (clusters). That way, two records in one cluster will have more in common than two records in different clusters. Some clustering methods define similarities as proximity to a cluster center, while others define it as proximity to other records in the cluster. The first group of methods then tries to find somewhat circular (spherical) clusters, while ones in the latter group work for clusters of other shapes as long as there is a space in between clusters (see Python docs (Links to an external site.) for more details). Also, some clustering methods allow us to define the number of clusters while others don't.
Clustering may have multiple applications and can be used with various data. It also may serve as a part of a recommendation mechanism (e.g., "People who bought this also liked that"), which can be used for more targeted marketing.
Depending on the application, clustering methods may be used differently. However, all common clustering methods do not allow them to prioritize any variables over others and do not allow them to create any special rules for defining cluster allocation rather than similarity-based ones discussed earlier.
Directions:
Compose a practical case for data mining that could employ clustering with a new set of conditions that would allow group records and won't fit into the existing paradigm of simple similarity with the equal treatment of all variables.
For example, a dataset of anonymous commuting rides may be deanonymized with clustering analysis. Then the condition for clustering may be to find rides with similar departure and arrival points, which had to happen around the same time of the day. Still, no more than one ride in the same cluster may be conducted on the same day (one person can not ride two vehicles on the same route around the same time on the same day). Then the clusters of similar rides conducted on different days may suggest the same commuter.