Partially observable Markov decision process

A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.

The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the operations research community, and was later adapted by the artificial intelligence and automated planning communities.

An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.

A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple $(S,A,T,R,\Omega ,O,\gamma )$ , where

At each time period, the environment is in some state $s\in S$ . The agent takes an action $a\in A$ , which causes the environment to transition to state $s'$ with probability $T(s'\mid s,a)$ . At the same time, the agent receives an observation $o\in \Omega$ which depends on the new state of the environment with probability $O(o\mid s',a)$ . Finally, the agent receives a reward equal to $R(s,a)$ . Then the process repeats. The goal is for the agent to choose actions at each time step that maximize its expected future discounted reward: $E\left[\sum _{t=0}^{\infty }\gamma ^{t}r_{t}\right]$ . The discount factor $\gamma$ determines how much immediate rewards are favored over more distant rewards. When $\gamma =0$ the agent only cares about which action will yield the largest expected immediate reward; when $\gamma =1$ the agent cares about maximizing the expected sum of future rewards.

...
Wikipedia