Exploratory Analysis for House Finch

 

We have performed preliminary analyses on data from Project Feeder Watch in order to determine the factors that affect the presence or absence of House Finches (Carpodacus mexicanus). The House Finch was selected because we know enough about this species to verify the results from these analyses.

House Finch Solo

We have performed preliminary analyses on data from Project Feeder Watch in order to determine the factors that affect the presence or absence of House Finches (Carpodacus mexicanus). The House Finch was selected because we know enough about this species to verify the results from these analyses. We selected data from two separate biogeographic areas, bird conservation regions used by Partners in Flight, to classify North America. These regions were the Lower Great Lakes/St. Lawrence Plain (BCR 13; 115,000 observations available) and Southeastern Coastal Plain (BCR 27; 23,000 observations). These two regions were selected because their contrasting climates and vegetation were expected to produce different responses of House Finches to the same set of environmental predictors.

Bird Conservation Regions
 
Figure 1: Bird Conservation Regions of North America.

We are using non-parametric data mining and machine learning techniques to explore the ecological patterns in these data. Decision trees have proven to be especially well suited for the predictive modeling of massive environmental data. “Ensemble methods”, like Bagging and Boosting, are then used to further improve predictive performance by combining the predictions from multiple related decision tree analyses. Advantages of decision tree-based ensemble methods include their ability to:

  • Produce highly accurate predictive models, often outperforming traditional statistical methods,
  • Accurately describe a wide variety of species’ distributions while making very few assumptions about the underlying relationships between predictors and responses,
  • Efficiently analyze very large data sets while automatically screening large numbers of potential predictors, and
  • Automatically deal with missing data.

Although basic data mining models have good predictive performance they are essentially “black box” methods, making it difficult to extract information about predicted patterns of occurrence. We are developing general purpose tools to provide useful summaries, both visual and numeric, of the patterns revealed by any predictive model.  For example, the barcharts in Figure 2 show the relative influence, larger scores indicating greater importance, of the ten most significant predictors for House finch occurrence in BCR13. These scores are normalized to sum to 100, allowing one to easily compare the relative influence of the same predictors of House finch occurrence in BCR27.  

PFW PIF Variable Importance

Figure 2:  The relative influence of ten top predictors for the House finch in BCR 13 and 27.

More detailed information about how a particular predictor affects occurrence are provided by partial dependence plots. For example, in Figure 3a the prevalence of House Finches can be seen to decline in BCR 13, as mycoplasmal conjunctivitis disease caused mortality in this region, while simultaneously House Finches were still colonizing and expanding their range in the southeastern lowlands of BCR 27.  Figure 3b shows another contrast between these two regions, with House Finches being less prevalent at bird feeders in the more northern BCR 13 in mid-winter, while being more prevalence further south at bird feeders at this same time. This latter result may indicate the extent of migration of House Finches from northern parts of their range and into more southern regions over-winter. Both plots are examples of the types of exploratory aids that can be produced from analyses of AKN data.

PIF PFW Season Date
Figure 3: a) Inter-seasonal trends and b) Intra-seasonal trends for BCR 13 (blue) and 27 (red). 

Further Developments

We are currently working to extend these exploratory tools in a number of ways to address other important ecological questions. This is achieved by combining the best features of the predictive, non-parametric data mining tools employed in the field of computer science, and the existing inferential tools used in the field of statistics. Specifically, we will be creating new, hybrid analytical techniques to:

  • Automatically detect important interactions between predictors,
  • Estimate confidence regions and statistical tests, and
  • Develop spatially explicit models of occurrence and abundance based on data mining models.