Article Preview
TopIntroduction
Making stochastic decisions is a part of everyday life. For example, suppose the fictional Camford University must always have a consistent energy supply. However, the local power grid cannot supply the necessary uninterruptable power due to transformer burnout, wild fires or other natural disasters. Camford wants to establish backup sources of energy using wind for when the public electrical grid fails. The potential savings will be uncertain due to the variability of wind and the uncertainty of sufficient power generation.
Even with a considerable amount of data to help with the decision making, there are still problems to overcome. Much of this data can have probabilities attached to it, be statistical in nature, or can be stochastic. Traditional relational databases do not process probabilistic data and do not have the additional functionality needed for analyzing statistical data or defining stochastic attributes. To overcome these limitations, considerable research has been done in the field of probabilistic and statistical databases as well as stochastic simulation, but this work also has limitations.
Probabilistic databases are databases capable of managing large amounts of uncertain data (Biazzo, 2009). Typically, each tuple, or part of a tuple, has a probability associated with it corresponding to its likelihood. Implicitly, this defines a probability distribution over all “possible worlds” each corresponding to a regular database. However, computing the exact confidence of the resulting tuples for a query is typically an NP complete problem. As a result, researchers have to use approximation techniques (Koch & Olteanu, 2008) or “systems that do not scale to the same extent as” relational databases (Dalvi, Ré, & Suciu, 2009).
Statistical databases are usually relational databases containing statistical data that have been extended to allow for advanced statistical analysis techniques. Unfortunately, the quality of data is not always optimal and there are security problems resulting in either the queries being too restrictive or the users possibly getting private information that they should not have access to (Gehrke, 2002). “The attention given to protecting mining output is fairly limited” (Wang & Liu, 2011) and still prohibits the user from controlling “precisely the formation of query sets” (Denning, 1980).
The shortcoming with probabilistic and statistical databases is that they deal with explicit probability distributions, whereas in many applications probabilistic distributions are complex and implicitly defined, e.g., by stochastic simulation. For instance the amount of power that can be wind generated for Camford as well as the possible savings from using wind generated power versus the more conventional means of backup can be difficult to define with explicit probability distributions. Stochastic simulation allows a user to create a complex model to simulate an event and serves as an implicitly defined stochastic model. However, it is time consuming to make decisions based on trial and error, even after applying heuristics to speed up the process (Altiparmak, Dengiz, & Bulgak, 2002).
In Farley, Brodsky, and Chen (2011), and Farley, Brodsky, Egge, and McDowall (2010), we proposed for the language SimQL to bridge this gap and allow the user to make decisions using queries similar to what they are familiar with using in standard relational databases (i.e., SQL), yet with stochastic attributes implicitly defined by a simulation. In Farley, Brodsky, and Chen (2012), we introduced an efficient regression based algorithm for top-k selection, General Optimal Regression Budget Allocation ScHeme (GORBASH).