Decision-making optimization

Integrating POMDPs with Deep Reinforcement Learning for planning in high-dimensional environments

On top of the challenges discussed in the POMDP project description (i.e. optimality properties over long planning horizons and efficient integration of uncertainties in the decision optimization process) another major challenge hindering efficient engineering systems decision-making in large scales is the curse of dimensionality in the state and action sets related to the number of system components that the system consists of.

That is, a system with 10 components having 5 possible damage states each (e.g. no, minor, major, severe, extensive, near-collapse damage), and 3 available actions each (e.g. do nothing, repair, replace), has nearly 10M possible states and 60K possible actions at every decision step from a system-level standpoint. This exponential scaling with the number of components renders the planning problem practically intractable by standard optimization schemes, from gradient-based and mixed-integer programming approaches to genetic algorithms and conventional Markov decision processes schemes.

Simplifying techniques to reduce this complexity mainly focus on the optimization of decision rules that are based on heuristic risk- and condition-based thresholds that activate certain decisions; block and/or periodic policies; clustering techniques that allow for policy uniformity of intra-cluster components; ranking/scoring metrics that prescribe component prioritization; and combinations thereof. Naturally, such assumptions, however, explore a limited subset of the space of all possible policies (i.e. combinations of actions with possible posterior system probability distributions), thus at best describing locally optimal, or otherwise suboptimal, policies. But how can we solve the original optimization problem?

Recent advances in Artificial Intelligence (AI) and machine learning provide a powerful answer to this question through Deep Reinforcement Learning (DRL). Integrating POMDPs with DRL allows us to parametrize the involved value and/or policy functions through neural networks, thus alleviating the curse of dimensionality related to joint probabilistic state representations. This parametrization is carried out by (deep) neural networks, whose parameters are trained through reinforcement learning. The latter, allows us to solve the underlying problem through real-time interaction with the belief-MDP environment, a process that gradually updates an initial policy until convergence.

The nonlinear approximation of the value function (e.g. expected life-cycle cost) can be achieved either directly, by parametrizing the action-value or the value function (e.g. Deep Q-Networks in the former case) or indirectly, by parametrizing the policy function (policy gradients), or by combinations of the above (actor-critic). The Deep Centralized Multi-agent Actor Critic (DCMAC) algorithm, integrates POMDPs with the actor-critic DRL framework. A critic network parametrizes the expected life-cycle cost, and a multi-agent actor outputs conditionally independent policies based on the entire system belief. In this approach, the actor is 'centralized', in the sense that parameters are shared for the hidden layers and global information is accessible to all agents, however, output actions are 'decentralized'. Drawing from the concept of point-based POMDPs, but without the need for using joint probabilistic representation, DCMAC operates directly on the belief space of the marginal component beliefs. The basic training concept is shown below.

Below is an example of a bridge truss structure subject to deterioration. This structural system consists of multiple components that undergo section losses due to operation in a corrosive environment, described by a nonstationary gamma process. The optimization task is to learn a policy that prescribes individualized component-level structural interventions (e.g. repainting of corroding surfaces, major member repairs, or replacements), along with joint system-level inspections.

This policy realization depicts the expected section losses (as estimated by recurrent Bayesian updates), the actions for each member, and the non-periodic times of inspection visits. An illustrative description of how DCMAC reasons about the environment can be observed in the figure below. This is a 2D embedding of the 350D last layer of the critic. Each point represents a belief the system may possibly reach throughout its life-cycle. It can be seen that for every possible belief we can obtain both a complete list of the actions that are needed at the current timestep and the remaining cost of the policy from this timestep on.

A sample of the DCMAC policy for a k-out-n system with 10 components subject to deterioration is also shown in the video below. The difference now is that only maintenance actions are planned, whereas noisy observations are continuously available for each one of the components. As in the previous example, it can be noticed that each component develops its own diverse policy, while a centralized overarching life-cycle objective is being optimized.

Comparing the life-cycle costs of the DCMAC policies with conventional baseline policies based on condition- and time-based considerations with or without component prioritization, we can observe in the figure below that DCMAC outperforms all its counterparts by 15 - 50% (p:precision of observations; T:time; C:condition; B:based, M:maintenance, I:repairs not available & no prioritization, II:repairs available & no prioritization, III: repairs not available & prioritization, IV: repairs available & prioritization).

Another notable feature is that the synergy of POMDPs and multi-agent DRL through DCMAC is shown to provide policies that outperform the optimized baselines even when these perform under better observability conditions (i.e. having better information). For example, DCMAC with 70% observability is better than CBMs and TCBMs performing under 100% observability (i.e. perfect information).

References:

Andriotis, C.P., and Papakonstantinou, K.G., “Deep reinforcement learning driven inspection and maintenance planning under incomplete information and constraints”, Reliability Engineering & System Safety (under review), arXiv preprint arXiv:2007.01380, 2020. [Link]

Andriotis, C.P., and Papakonstantinou, K.G., “Managing engineering systems with large state and action spaces through deep reinforcement learning”, Reliability Engineering & System Safety, 191 (11), 106483, 2019. [Link]

Andriotis C.P., and Papakonstantinou, K.G., “Life-cycle policies for large engineering systems under complete and partial observability”, 13th International Conference on Applications of Statistics and Probability in Civil Engineering (ICASP), Seoul, South Korea, June, 2019. [Link]

Resources:

Data, Documentation, Presentations

back

Contact

Faculty of Architecture & the Built Environment

Delft University of Technology

Julianalaan 134, 2628 BL, Delft

email: c.andriotis [at] tudelft [dot] nl

Decision-making optimization

Integrating POMDPs with Deep Reinforcement Learning for planning in high-dimensional environments

​

​

​

References:​

​​

back

Contact

References: