AKN Publications
Data Intensive
Science: A New Paradigm for Biodiversity Studies (PDF)
Authors: Steve Kelling, Wesley M.
Hochachka, Daniel Fink, Mirek Riedewald, Rich Caruana, Grant Ballard,
Giles Hooker
Publication: Bioscience, in press
Abstract: The increasing availability of massive volumes of scientific data requires new synthetic analysis techniques to explore and identify interesting patterns that were otherwise not apparent. For biodiversity studies a “data driven” approach is necessary due to the complexity of ecological systems, particularly when viewed at large spatial and temporal scales. Data intensive science organizes large volumes of data from multiple sources and fields that are then analyzed using techniques tailored to the discovery of complex patterns in high dimensional data through visualizations, simulations, and various types of model building. By interpreting and analyzing these models, truly novel and surprising patterns that are “born from the data” can be discovered. These patterns in turn provide valuable insight for concrete hypotheses about the underlying ecological processes that created the observed data. Data intensive science allows scientists to analyze bigger and more complex systems efficiently, and complements more traditional scientific processes of hypothesis generation and experimental testing to refine our understanding of the natural world. Accepted to Bioscience, March 2009.
eBird: A Citizen-based Bird Observation Network in the Biological
Sciences
Authors: Brian L Sullivan; Christopher L
Wood; Marshall J Iliff; Rick E Bonney; Daniel Fink; Steven T
Kelling
Publication: Biological Conservation, in
press
Abstract: New technologies are rapidly changing the way we collect, archive, analyze, and share scientific data. For example, over the next several years it is estimated that more than 1 billion autonomous sensors will be deployed over large spatial and temporal scales, and will gather vast quantities of data. Networks of human observers play a major role in gathering scientific data, and whether in astronomy, meteorology, or observations of nature, they continue to contribute significantly. In this paper we present an innovative use of the Internet and information technologies that better enhances the opportunity for citizens to contribute their observations to science and the conservation of bird populations. eBird is building a web-enabled community of bird watchers who collect, manage, and store their observations in a globally accessible unified database. Through its development as a tool that addresses the needs of the birding community, eBird sustains and grows participation. Birders, scientists, and conservationists are using eBird data worldwide to better understand avian biological patterns and the environmental and anthropogenic factors that influence them. Developing and shaping this network over time, eBird has created a near real-time avian data resource producing millions of observations per year. Accepted to Biological Conservation, May 2009.
Gaussian Semiparametric Analysis Using Hierarchical Predictive Models
(PDF)
Author: Daniel Fink and Wesley
Hochachka.
Publication: In: D.L. Thomson et al.
(eds.), Modeling Demographic Processes in Marked Populations,
Environmental and Ecological Statistics 3, DOI
10.1007/978-0-387-78151-8 46, Copyright Springer Science+Business
Media, LLC 2009
Abstract: The Hierarchical Predictive Model (HPM) is a semiparametric mixed model where the fixed effects are fit with a user-specified non-parametric component. This approach extends current spline-based semiparametric mixed model formulations, allowing for more flexible nonparametric estimation. Greater adaptability simplifies model specification making it easier to analyze data sets with large numbers of predictors. Greater automation also extends the scope of exploratory analyses that may be performed with mixed models. Using a HPM, the analyst may select the predictive model to best suit their needs, exploiting the strengths of currently available predictive methods. A simulation study is used to demonstrate the advantages of accounting for known hierarchical structure in predictive models and to illustrate the adaptability of current decision-tree based predictive models. A HPM of the relative abundance of the North American House Finch (Carpodacus mexicanus) is used to demonstrate exploratory analysis with a real data set.
REGIONAL ANALYSIS OF
RIPARIAN BIRD SPECIES RESPONSE TO VEGETATION AND LOCAL HABITAT FEATURES
(PDF)
Authors: NADAV NUR, GRANT BALLARD, AND
GEOFFREY R. GEUPEL.
Publication: The Wilson Journal of
Ornithology 120(4):840–855, 2008
Abstract: We investigated relationships between riparian bird abundance and local vegetation characteristics and habitat features across the Sacramento/San Joaquin Valley, California. Number of detections was analyzed for each of 21 species from point count surveys over a 4-year period at 22 sites from three regions (Sacramento River, Cosumnes River, and San Joaquin River) in relation to 16 measures of habitat and vegetation composition within 50 m of 184 survey points. Tree variables, including tree height and trunk diameter, were often important, as was specific composition of tree species, especially Fremont cottonwood (Populus fremontii) and valley oak (Quercus lobata). Effects of mugwort (Artemisia douglasiana) and blackberry (Rubus spp.) were generally positive. The median partial R2 due to vegetation/habitat characteristics was 16% after controlling for regional differences in abundance per species. Comparisons of model results at the local versus regional scale revealed spatial variation in bird abundance was independent of spatial variation in habitat variables. The effect of a habitat variable differed among the three regions for 11 of 16 variables. Models that used one or more of the first three principal components (extracted from the 16 vegetation and habitat variables) had substantially lower predictive ability than models built using individual variables. The results emphasize the importance of both understory vegetation and tree characteristics at different spatial scales. Local vegetation and habitat characteristics are important in explaining variation in local abundance, but there is a need to develop models specific to each subregion. Received 28 July 2006. Accepted 1 March 2008.
Significance of
organism observations: Data discovery and access in biodiversity
research. (PDF)
Author: Steve Kelling.
Publication: Report for the Global
Biodiversity Information Facility, Copenhagen. 2008.
Data-Mining
Discovery of Pattern and Process in Ecological Systems (PDF)
Authors: W.M. Hochachka, R. Caruana, D. Fink, S. Kelling, A. Munson, M. Riedewald, D. Sorokina, S. Kelling
Publication: Journal of Wildlife Management 71(7): 2427-2437. 2007.
Abstract: Most ecologists use statistical
methods as their main analytical tools when analyzing data to identify
relationships between a response and a set of predictors; thus, they
treat all analyses as hypothesis tests or exercises in parameter
estimation. However, little or no prior knowledge about a system can
lead to creation of a statistical model or models that do not
accurately describe major sources of variation in the response
variable. We suggest that under such circumstances data mining is more
appropriate for analysis. In this paper we 1) present the distinctions
between data-mining (usually exploratory) analyses and parametric
statistical (confirmatory) analyses, 2) illustrate 3 strengths of
data-mining tools for generating hypotheses from data, and 3) suggest
useful ways in which data mining and statistical analyses can be
integrated into a thorough analysis of data to facilitate rapid
creation of accurate models and to guide further research.
keywords/keyphrases: bagging, data mining, decision trees, exploratory data analysis, hypothesis generation, machine learning, prediction
Mining Citizen
Science Data to Predict Prevalence of Wild Bird Species (PDF)
Authors: R. Caruana, M. Elhawary, A.
Munson, M. Riedewald, D. Sorokina, D. Fink, W. Hochachka, S.
Kelling.
Publication: Proceedings of the 12th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD'06), Philadelphia, PA, 2006
Description: Ecologists are interested in identifying which features have the strongest effect on the distribution and abundance of bird species as well describing the forms of these relationships. We show how data mining can be successfully applied to the environmental data sets, enabling the ecologists to discover unanticipated relationships. We compare a variety of methods for measuring attribute importance with respect to the probability of a bird being observed at a feeder and discuss the biological relevance of the results.
keywords/keyphrases: attribute importance, bagging, decision trees, model inspection, partial dependence function, sensitivity analysis