Knowledge discovery in databases (KDD) was initially defined as the ''non-trivial extraction of implicit, previously unknown, and potentially useful information from data'' (Frawley et al, 1991). A revised version of this definition states that ''KDD is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data'' (Fayyad et al., 1996). According to this definition, data mining (DM) is a step in the KDD process concerned with applying computational techniques (i.e., DM algorithms implemented as computer programs) to actually find patterns in the data. In a sense, DM is the central step in the KDD process. The other steps in the KDD process are concerned with preparing data for DM, as well as evaluating the discovered patterns (the results of DM).
The above definitions contain imprecise notions, such as knowledge and pattern. To make these (slightly) more precise, additional explanations are necessary concerning data, patterns, and knowledge, as well as validity, novelty, usefulness, and understandability. For example, the discovered patterns should be valid on new data with some degree of certainty (typically prescribed by the user). The patterns should potentially lead to some actions that are useful (according to user-defined utility criteria). Patterns can be treated as knowledge: according to Frawley etal, "a pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user's criteria) is called knowledge.''
This article focuses on DM and does not deal with the other aspects of the KDD process (such as data preparation). Since DM is concerned with finding patterns in data, the notions of most direct relevance here are the notions of data and patterns. Another key notion is that of a DM algorithm, which is applied to data to find valid patterns in the data. Different DM algorithms address different DM tasks, that is, have different intended uses for the discovered patterns.
Data are sets of facts, for example, cases in a database. Most commonly, the input to a DM algorithm is a single flat table comprising a number of attributes (columns) and records (rows). When data from more than one table in a database need to be taken into account, it is left to the user to join (or otherwise manipulate) the relevant tables to create a single table, which is then used as input to a DM algorithm.
The output of a DM algorithm is typically a pattern or a set of patterns that are valid in the given data. A pattern is defined as a statement (expression) in a given language, that describes (relationships among) the facts in a subset of the given data and is (in some sense) simpler than the enumeration of all facts in the subset. Different classes of pattern languages are considered in DM: they depend on the DM task at hand. Typical representatives are equations; classification and regression trees; and association, classification, and regression rules. A given DM algorithm will typically have a built-in class of patterns that it considers: the particular language of patterns considered will depend on the given data (the attributes and their values).
Many DM algorithms come from the fields of machine learning and statistics. A common view in machine learning is that machine learning algorithms perform a search (typically heuristic) through a space of hypotheses (patterns) that explain (are valid in) the data at hand. Similarly, we can view DM algorithms as searching, exhaustively or heuristically, a space of patterns in order to find interesting patterns that are valid in the given data.
In this article, we first look at the prototypical format of data and the main DM tasks addressed in the field of DM. We next describe the most common types of patterns that are considered by DM algorithms, such as equations, trees, and rules. We also outline some of the main DM algorithms searching for patterns of the types mentioned above.
Environmental sciences comprise the scientific disciplines, or parts of them, that consider the physical, chemical, and biological aspects of the environment. A typical representative of environmental sciences is ecology, which studies the relationships among members of living communities and between those communities and their abiotic (nonliving) environment.
Such a broad, complex, and interdisciplinary field holds much potential for the application of KDD methods. However, environmental sciences also pose many challenges to existing KDD methods. In this article, we attempt to give an overview of KDD applications in environmental sciences, complemented with a sample of case studies in which the author has been involved. Besides exemplifying the use of DM, these case studies also illustrate important KDD/DM-related issues that arise in environmental applications.
Was this article helpful?