Proper Treatment 正當作法/ cs504/ 2007/ Evolving outline
2008-08-17 19:19

1/17 Agency

Specifying a problem is often a long way towards finding and implementing an algorithm to solve it. The problem that perception (and learning and scientific inquiry) helps solve is action under uncertainty. The Monty Hall problem illustrates this help and the need to make assumptions explicit. Computational modeling represents such decision processes as programs, which us scientists and (in geeky self-reference) the agents we study can then interpret (in particular, run and solve in separate modules).

1/24 Generation

Facing uncertainty requires hypothesizing and confronting all possibilities, if only conceptually being able to enumerate them, and if only to dismiss the vast majority of them in the face of experience. The definition of such a generative model usually follows a dependency (often causal) structure, such as making a Markov assumption. A generative model yields possible outcomes for an actor, possible worlds for a learner, or possible theories for a scientist, whether scored or unscored by a probability distribution.

1/31 Combining uncertainty

A joint probability distribution represents multiple sources of information. These sources and their uncertainty can be combined in two ways:

Normal distributions exemplify these operations on uncertain information numerically with their easily plotted error bars and ellipses.

2/7 Explanation

It is worth distinguishing the many ways to generate the same outcome, because each possibility may call for a different course of action. A generative model, used “backwards” in terms of its dependency structure, defines what it means to explain some experience or to justify a conclusion. An agent acting rationally in an uncertain world combines its perception, action, and prior knowledge to update its belief and guide its behavior.

Mildly noisy observations may however result in wildly inaccurate explanations; we can combine observations to mitigate error. The same sensory observations are usually consistent with (that is, explained by) multiple possible states of affairs, as illustrated by illusions in vision and ambiguities in language.

2/14 Independence

Conditional independence is a simplifying assumption that nature ignores one variable when picking the value for another. For example, a Markov model assumes that, given the state at each time step, the next state is independent of all previous states, and the current observation is independent of all other states and observations. It gives rise to shortest-path algorithms that reuse computation to efficiently infer probabilities (such as the sequence of most likely states) and explanations (such as the most likely sequence of states). It also reduces the amount of training data we need to set the parameters of a model (such as transition probabilities) so as to generalize and not just memorize.

2/21 Reusing computation

The algorithms for hidden Markov models exemplify a general pattern. Even when our model generates a conceptually exponential number of possibilities, we can take advantage of the dependency structure of the model—how its parts ignore or forget each other—to decompose the problem into parts whose solutions are independent of each other. Such a decomposition then lets us reuse computation and perform efficient inference. This strategy is called dynamic programming, memoization, and tabling.

Another example of this pattern is parsing with (probabilistic) context-free grammars. Context freeness (unfortunately not present in Swiss German) lets us reuse partial parses to build complete ones, not to mention represent ambiguity compactly.

2/28 Constraints

Another way to reduce computational complexity in the face of a combinatorial explosion of possibilities is to propagate constraints, again following dependency structure. Two examples of this strategy is how humans solve Sudoku puzzles and how computers interpret line drawings. The latter task is already illustrative in the simple case of trihedral polyhedra. Propagating equality constraints (such as of grammatical agreement features), or unification, is especially efficient.

In the worst case, we still need to resort to backtracking search. When the search space is large, approximation (such as pruning) and heuristics become essential.

3/7 Learning

Not only can objective truth be thought of as a probability—the likelihood—but so can subjective beauty, by a prior distribution (aka innate universal): Bayesian probability treats ignorance as uncertainty, learning as inference, and belief as a distribution over hypotheses, where rational action maximizes expected utility. The Bayes rule lets us update our prior belief with new experience to give a posterior belief. This update is easier for Dirichlet and Gaussian priors.


Planning and theorem proving?

Sampling techniques for hypothesis testing; basics of the bootstrap?

2/21 Likelihood

Abstracting away from actions and utilities, a good model is one that is true and beautiful. In terms of truth, it is popular to measure how well a model matches our past observations by its likelihood, which is the probability it assigns to the observations. This probability depends only on sufficient statistics. The likelihood also measures entropy, the amount of information in the observations. A quick way to approximate likelihood is by rejection or importance sampling.

Example: Sprinkler and wet grass; successive sightings of a moving object; points from a mixture of Gaussians; a hidden Markov model.

Recitation: Continuous probability theory? Monte Carlo Markov Chain?

Play: Trade off bias and variance by adjusting the model parameters in the examples.

2/28 Inference

Probabilistic inference includes computing the likelihood, the most probable explanation, and the rational course of action. Local constraints exert global influences in these computations. Exact inference is computationally intractable in the worst case, but it oftens helps to take advantage again of dependency structure and to approximate.

Examples: Sprinkler and wet grass (explaining away); a hidden Markov model; probabilistic parsing.

Play: Try different observations and inference tasks in the examples.

3/21 Generalization

The maximum-likelihood (or more generally maximum-a-posterior) hypothesis tends to overfit observation data and generalize poorly. When there are too many hypotheses and not enough data, the curse of dimensionality prevents us from choosing a good single hypothesis. To make a better bias-variance tradeoff, we can average the hypotheses (as a Bayesian would), smooth the data, or prune the model (by (cross) validation, minimum description length, or other criteria).

The uniform prior may look neutral like Helvetica, but it hides its assumptions in its parameterization. Another prior is to invoke Occam’s razor and penalize entropy (complexity or uncertainty) in the hypothesis. This penalty turns out equivalent to negative experience.

Examples: Dice; Gaussian; n-gram language models.

Recitation: Decision trees and ID3?

3/28 Optimization

Sometimes we can compute the best model just using calculus. Other times we only know to guess a model then improve it step by step. This iterative process is called gradient descent because it is like trying to hike to the bottom of some terrain: one usually follows the steepest downhill, but bumpy terrain calls for jumping, inertia, and earthquakes. The expectation-maximization algorithm iteratively optimizes a model alongside an explanation.

Examples: Gaussians and their mixtures; a hidden Markov model of visual input over time; Electric Sheep.

Play: Watch and tweak gradient descent.

4/4 Discrimination

Perceptrons and other neural networks are not generative but optimized to discriminate, for example to minimize mean squared error or maximize the margin. Naive Bayes models are also designed to discriminate but can be understood as assuming probability independence.

4/11 Final proposals

4/18 Time

Discrete time steps versus event queues. Continuous and particle filtering. Markov decision processes and reinforcement learning.

4/25 Final projects