Clustering in DBDW

introduced the concept of data mining and to the free and open source software Waikato Environment for Knowledge Analysis (WEKA), which allows you to mine your own data for trends and patterns. I also talked about the first method of data mining — regression — which allows you to predict a numerical value for a given set of input values. This method of analysis is the easiest to perform and the least powerfu


I introduced the concept of data mining and to the free and open source software Waikato Environment for Knowledge Analysis (WEKA), which allows you to mine your own data for trends and patterns. I also talked about the first method of data mining — regression — which allows you to predict a numerical value for a given set of input values. This method of analysis is the easiest to perform and the least powerful method of data mining, but it served a good purpose as an introduction to WEKA and provided a good example of how raw data can be transformed into meaningful information.
In this article, I will take you through two additional data mining methods that are slightly more complex than a regression model, but more powerful in their respective goals. Where a regression model could only give you a numerical output with specific inputs, these additional models allow you to interpret your data differently. As I said in Part 1, data mining is about applying the right model to your data. You could have the best data about your customers (whatever that even means), but if you don't apply the right models to it, it will just be garbage. Think of this another way: If you only used regression models, which produce a numerical output, how would Amazon be able to tell you "Other Customers Who Bought X Also Bought Y?" There's no numerical function that could give you this type of information. So let's delve into the two additional models you can use with your data.
In this article, I will also make repeated references to the data mining method called "nearest neighbour," though I won't actually delve into the details until. However, I included it in the comparisons and descriptions for this article to make the discussions complete.

Load the data file labor.arff into WEKA using the same steps we used to load data into the Pre-process tab. Take a few minutes to look around the data in this tab. Look at the columns, the attribute data, the distribution of the columns, etc. Your screen should look like Figure 5 after loading the data.  

With this data set, we are looking to create clusters, so instead of clicking on the Classify tab, click on the Cluster tab. Click Choose and select SimpleKMeans from the choices that appear (this will be our preferred method of clustering for this article). Your WEKA Explorer window should look like Figure 6 at this point.