Project Feeder Watch Exploratory Analysis

A detailed description of the methods used for our exploratory analysis.

Contents

  1. Project Feeder Watch Data Description
  2. Data Mining Analysis
  3. Identifying Important Predictors of Species' Winter Distributions
  4. Visualizations with Partial Dependence Plots
  5. Refereces

1 Project Feeder Watch Data Description

Project FeederWatch (PFW) [http://birds.cornell.edu/pfw] is a winter-long survey of birds visiting feeders across the United States and Canada. From mid-November until the very beginning of April, and as frequently as once a week, each participant in PFW records and reports the birds that they have seen attracted to bird feeders at one pre-determined location. Observers report the maximum number of birds seen at one time during an observation period for each species. Each observation period is 2 days long, typically a weekend. Observer effort is recorded. Every observer is also encouraged to provide information describing the environment around their feeder site.

Project FeederWatch data were analyzed with the goal of describing the winter distributions of common feeder birds across the continental United States and southern Canada. In addition to the information provided by PFW participants, our data set contained over 100 site-specific predictor variables describing local habitat, climate, and human environment. These additional predictor variables were almost all available only for the United States. In order to better capture regional variation in species’ winter distributions, analyses were conducted separately for each species and Bird Conservation Region (BCR) for which there was sufficient data to perform robust data-mining.

2 Data Mining Analysis

Restricting ourselves to BCRs that lie at least partially within the continental US, we used the following criteria to select Species x BCR combinations for inclusion in the analysis:

  1. there were more than 1000 PFW reports, either presence or absence, of a species within a given BCR,
  2. the percentage of reports for which the species was present within a given BCR was greater than 5%, and
  3. the number of feeder locations making reports within the BCR was greater than 100.

Using these selection criteria we analyzed a total of 636 Species x BCR combinations comprising 89 species over 27 BCRs. For each Species x BCR combination we used non-parametric data mining techniques to create predictive models of the probability of species' presence, and identify the predictors that were most closely associated with presence and absence of each species. Decision trees have proven to be especially well suited for the predictive modeling of massive environmental data (D’eath and Fabricious 2000, Elith et. al. 2006). Each of our models did not rely on a single decision tree, but instead we used an “ensemble” method which combines the predictions from multiple related decision trees in order to improve predictive performance (Bagging and Boosting are examples; Breiman 1996 & 2001, Freund and Schapire 1996, Friedman 2001). Advantages of decision tree-based ensemble methods include their ability to:

  • automatically produce highly accurate predictive models, often matching or even exceeding the performance of traditional statistical models specified by experts,
  • accurately describe a wide variety of species’ distributions while making very few assumptions about the underlying relationships between predictors and responses,
  • efficiently analyze very large data sets while automatically screening large numbers of potential predictors, and
  • automatically deal with missing data.

For our analysis we used a bagged ensemble of 100 ID3 decision trees, produced by the IND package (Buntine 1993). Recent work has shown that for binary classification bagged trees are competitive with the best available learning methods (Caruana 2003, Niculescu 2005).

3 Identifying Important Predictors of Species' Winter Distributions

We wanted to identify the most important features that consistently predicted bird prevalence, across all species and BCRs. To do this we needed to compute a measure of predictor importance to assess the importance of all 200 variables for each of the 636 analyses, potentially a very large computational task. We developed new heuristic measures to efficiently identify those predictors that had the greatest impact on the accuracy of model predictions by analyzing the structure of the decision trees (Caruana et. al. 2006). Using these measures we ranked the importance of each variable in each analysis. While this list varied somewhat among the species and BCRs, several variables consistently appeared in the species- and BCR-specific lists of 20 most influential variables. We chose the 9 most consistently influential variables for graphing. Nine was pragmatically chosen, as a 3 x 3 display of graphs fits well on a web page.

The nine predictor variables that most frequently appeared in the top 20 lists were:

  1. Year
  2. Date
  3. Longitude
  4. Latitude
  5. # Half Days Observer Effort
  6. # Hanging Feeders
  7. # Suet Feeders
  8. Elevation(m)
  9. Human Population Density

Note that the first variable "Year" does not represent a calendar year. Instead, each "Year" is a different winter season, spanning 2 calendar years. Similarly, "Date" is the date since the start of a PFW season, and not the calendar date.

These nine predictors describe general classes of important processes affecting ecological distributions:

  • temporal and spatial variation in the natural environment,
  • intensity of human modification of the environment (including modifications at individual sites, such as bird feeders), and
  • variation in the observational process (observer effort).

4 Visualizations with Partial Dependence Plots

After the most important predictors have been identified, the next step is reveal how the modeled probability of occurrence (the response variable of our model) is affected by these predictors. A natural approach to this investigation is to graph predictions from the model as a function of an individual predictor. For example, we may want to know how the probability of occurrence of American Goldfinch varies as function of elevation. Unfortunately, the probability of occurrence may be simultaneously affected by many predictors making it difficult to isolate and describe the effects of any individual predictor (e.g. habitat type, weather, and elevation all simultaneously affect Goldfinch distribution).

Predictors may play one of two roles when investigating effects: predictors may be “focal” predictors whose effects we are investigating otherwise they are “nuisance” predictors, all non-focal predictors in the model. In order to separate the effects of the focal predictor, e.g. elevation, from the effects of the nuisance predictors we compute and plot the mean probability of occurrence as a function of elevation where the mean is taken across the joint values of the nuisance predictors. For example, to compute the effect of a 1000 meter elevation on Goldfinch prevalence we replace the actual observed elevation values in our data with 1000. Then the mean probability of occurrence at 1000m is computed from this “synthetic” data set. This averaging procedure is repeated for each desired elevation. By taking the mean we “average out” the variation in the modeled response due to all other predictors in the model making it easier to uncover additive effects of elevation.

This average predicted value is known as the partial dependence of the model on the focal predictor; it measures the effect of the focal predictor on the modeled response (here probability of occurrence) after accounting for the average effect of all other predictors (Friedman, 2001, Hastie, Tibshirani & Friedman, 2001). The averaging takes place over a synthetic dataset constructed to preserve the observed joint density of the nuisance predictors. In practice, this marginalization can be computationally expensive and is often approximated using Monte Carlo techniques. Because these computations rely only on model predictions, partial dependence functions can be used to interpret any predictive model.

Follow these links to see how partial dependence functions (PDFs) can be used to visualize and discover predictor effects:

  • In Visualizing Predictor Effects with Partial Dependence Plots we use one dimensional PDFs to visualize the additive effects of each winter season and within-season date on the occurrence of Eastern House Finch and we use two dimensional PDFs to visualize the interacting effects winter season and within-season date that describe the irruptive winter migrations of American Goldfinch.
  • In Visualizing Dynamics of Wild Birds we use three dimensional PDFs to visualize the spatio-temporal dynamics of the irruptive winter migration of American Goldfinch.

5 References

  • Breiman, L. (1996). Bagging Predictors. Machine Learning 26: 123-140.
  • Breiman, L. (2001). Random Forests. Machine Learning 45: 5-32.
  • Caruana, R., Niculescu, S., Rao, B., and Simms, C.. Evaluating the C-section rate of different physician practices: Using machine learning to model standard practice. In The American Medical Informatics Conference (AMIA), 2003.
  • Caruana, R. , Elhawary, M. , Munson, A., Riedewald, M. , Sorokina, D., Fink, D., Hochachka, W., Kelling, S. (2006) Mining Citizen Science Data to Predict Occurrence of Wild Bird Species. To appear in Proc. Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, New York. pp.909-915.
  • De’ath, G. and Fabricius, K. E. (2000). Classification and regression trees: a powerful yet simple technique for ecological data analysis. Ecology 81(11):3178–3192
  • Elith, J., C.H. Graham, R.P. Anderson, M. Dudik, S. Ferrier, A. Guisan, R.J. Hijmans, F. Huettmann, J.R. Leathwick, A. Lehmann, J Li, L.G. Lohmann, B.A. Loiselle, G. Manion, G. Moritz, M. Nakamura, Y. Nakazawa, J. McC. Overton, A.T. Peterson, S.J. Phillips, K. Richardson, R. Scachetti-Pereira, R.E. Schapire, J. Soberón, S. Williams, M.S. Wisz, and N.E. Zimmermann. 2006. Novel methods improve prediction of species' distributions from occurrence data. Ecography 29:129-151.
  • Friedman, J. H. (2001). Greedy functiojn approximation: a gradient boosting machine. Annals of Statistics 29: 1189-1232.
  • Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm. Machine Leraning: Proceedings of the Thirteenth International Conference, Morgan Kauffman, San Francisco, 148-156.
  • Hastie, T., R. Tibshirani, and J. H. Friedman 2001. The Elements of Statistical Learning, Data Mining, Inference, and Learning. Springer, 552 pp.
  • Niculescu-Mizil, A. and Caruana, R. Predicting good probabilities with supervised learning. In Proc. Int. Conf. on Machine Learning (ICML), 2005.