Exploratory Analysis for House Finch
We have performed preliminary analyses on data from Project Feeder Watch in order to determine the factors that affect the presence or absence of House Finches (Carpodacus mexicanus). The House Finch was selected because we know enough about this species to verify the results from these analyses.


Figure 1: Bird Conservation Regions of North America.
We are using non-parametric data mining and machine learning techniques
to explore the ecological patterns in these data. Decision trees have
proven to be especially well suited for the predictive modeling of
massive environmental data. “Ensemble methods”, like Bagging and
Boosting, are then used to further improve predictive performance by
combining the predictions from multiple related decision tree analyses.
Advantages of decision tree-based ensemble methods include their
ability to:
- Produce highly accurate predictive models, often outperforming traditional statistical methods,
- Accurately describe a wide variety of species’ distributions while making very few assumptions about the underlying relationships between predictors and responses,
- Efficiently analyze very large data sets while automatically screening large numbers of potential predictors, and
- Automatically deal with missing data.
Although basic data mining models have good predictive performance they are essentially “black box” methods, making it difficult to extract information about predicted patterns of occurrence. We are developing general purpose tools to provide useful summaries, both visual and numeric, of the patterns revealed by any predictive model. For example, the barcharts in Figure 2 show the relative influence, larger scores indicating greater importance, of the ten most significant predictors for House finch occurrence in BCR13. These scores are normalized to sum to 100, allowing one to easily compare the relative influence of the same predictors of House finch occurrence in BCR27.

Figure 2: The relative influence of ten top predictors for the
House finch in BCR 13 and 27.
More detailed information about how a particular predictor affects
occurrence are provided by partial dependence plots. For example, in
Figure 3a the prevalence of House Finches can be seen to decline in BCR
13, as mycoplasmal conjunctivitis disease caused mortality in this
region, while simultaneously House Finches were still colonizing and
expanding their range in the southeastern lowlands of BCR 27.
Figure 3b shows another contrast between these two regions, with House
Finches being less prevalent at bird feeders in the more northern BCR
13 in mid-winter, while being more prevalence further south at bird
feeders at this same time. This latter result may indicate the extent
of migration of House Finches from northern parts of their range and
into more southern regions over-winter. Both plots are examples of the
types of exploratory aids that can be produced from analyses of AKN
data.

Figure 3: a) Inter-seasonal trends and b) Intra-seasonal trends for BCR
13 (blue) and 27 (red).
Further Developments
We are currently working to extend these exploratory tools in a
number of ways to address other important ecological questions. This is
achieved by combining the best features of the predictive,
non-parametric data mining tools employed in the field of computer
science, and the existing inferential tools used in the field of
statistics. Specifically, we will be creating new, hybrid analytical
techniques to:
- Automatically detect important interactions between predictors,
- Estimate confidence regions and statistical tests, and
- Develop spatially explicit models of occurrence and abundance based on data mining models.