introduced the
concept of data mining and to the free and open source software Waikato
Environment for Knowledge Analysis (WEKA), which allows you to mine your own
data for trends and patterns. I also talked about the first method of data
mining — regression — which allows you to predict a numerical value for a given
set of input values. This method of analysis is the easiest to perform and the
least powerfu
Load the data file labor.arff
into WEKA using the same steps we used to load data into the Pre-process tab. Take a few
minutes to look around the data in this tab. Look at the columns, the attribute
data, the distribution of the columns, etc. Your screen should look like Figure
5 after loading the data.
I introduced
the concept of data mining and to the free and open source software Waikato
Environment for Knowledge Analysis (WEKA), which allows you to mine your own
data for trends and patterns. I also talked about the first method of data
mining — regression — which allows you to predict a numerical value for a given
set of input values. This method of analysis is the easiest to perform and the
least powerful method of data mining, but it served a good purpose as an
introduction to WEKA and provided a good example of how raw data can be
transformed into meaningful information.
In this
article, I will take you through two additional data mining methods that are
slightly more complex than a regression model, but more powerful in their
respective goals. Where a regression model could only give you a numerical
output with specific inputs, these additional models allow you to interpret
your data differently. As I said in Part 1, data mining is about applying the
right model to your data. You could have the best data about your customers
(whatever that even means), but if you don't apply the right models to it, it
will just be garbage. Think of this another way: If you only used regression
models, which produce a numerical output, how would Amazon be able to tell you
"Other Customers Who Bought X Also Bought Y?" There's no numerical
function that could give you this type of information. So let's delve into the
two additional models you can use with your data.
In this
article, I will also make repeated references to the data mining method called
"nearest neighbour," though I won't actually delve into the details
until. However, I included it in the comparisons and descriptions for this
article to make the discussions complete.
With this data set, we
are looking to create clusters, so instead of clicking on the Classify
tab, click on the Cluster tab. Click Choose
and select SimpleKMeans from the choices that appear (this
will be our preferred method of clustering for this article). Your WEKA
Explorer window should look like Figure 6 at this point.