Learn to Conduct Data-mining Analyses With AKN Data

Learn to Conduct Data-mining Analyses With AKN Data

The philosophy behind data-mining and the software used to conduct data-mining analyses are relatively unfamiliar to ecologists, compared to statistical analysis and software. As a result, we have written a paper describing the philosophy behind data-mining, and provided some basic examples of the types of insights that can be gained from data mining (Aug 2007 issue of the Journal of Wildlife Management). This paper, however, does not provide the “nuts and bolts” information needed for someone to learn to to actually conduct a data-mining analysis. Thus, we have produced an set of example computer code and a sample data set, which can be used by someone to learn how to conduct data-mining analyses.

    The example program is written in the R statistical language.  While there are many different software packages available for data-mining, we chose to create our example in R because R is free, open-source software likely familiar to more ecologists than any of the other software packages currently available that are able to conduct data-mining analyses.  Our example computer code will show you how to do three things: (1) create a bagged decision-tree analysis and produce predictions, (2) identify important predictor variables, and (3) conduct auxiliary analyses and produce graphs that visualize the relationships between predictors and the response (i.e. dependent) variable.  The role of each of these tasks is described in the Journal of Wildlife Management paper.

    The example programs were written under the assumption that a user already has at least a basic knowledge of the R programming language and how to use the R online documentation.  The standard installation of R should include all of the needed components in order to successfully run the example program.  Our example is copiously commented, describing the purpose of each step in the program.  We have attempted to write the program is a way that is easily understandable, at the expense of programming elegance; i.e. we have often broken a process into several smaller steps instead of a using single but more complex program statement.  Also, we have deliberately not made use of optionally modules (called “libraries”) in R, even when these libraries contain functions that could be used.  Nevertheless, in order to fully understand how the existing program works and to successfully modify the existing program, we recommend that users read the R documentation.  In the future, we hope to expand on our example programs in a number of ways such as: creating generic functions for common tasks like measuring variable importance, and making use of optional R libraries in order to provide more detailed graphical output and measures of predictive performance.

Download the PDF

of Data-Mining Discovery of Pattern and Process in Ecological Systems

Download the Word Document "How to Instructions and R code"

Download sample data for sample data-mining analysis.