I'm working on a reinforcement learning problem where the environment returns a reward pair $(r_{t+1}^{(a)}, r_{t+1}^{(b)})$. The goal is to maximize the following nonlinear objective function. $$ E[\lim_{T \to \infty } \frac{ \sum_{r=t}^{t+T-1} r_{r+1}^{(a)}}{\sum_{r=t}^{t+T-1} r_{r+1}^{(a)} + \sum_{r=t}^{t+T-1} r_{r+1}^{(b)}}] $$ which is the ratio of the agent's cumulative rewards as a fraction of the total cumulative rewards. My intention is to employ Deep Q-Networks as the primary model for reinforcement learning to solve this environment. However, due to the non-linear nature of the objective function, I encountered challenges in applying the original DQN algorithm. As a result, I explored an alternative approach by framing the problem as a multi-objective reinforcement learning problem. I seek validation regarding the appropriateness of this decision.
$\begingroup$
$\endgroup$
2
-
1$\begingroup$ agent's cumulative rewards as a fraction of the total cumulative rewards is it a multi agent env or what you mean with "total" $\endgroup$Alberto– Alberto2024-05-13 16:51:19 +00:00Commented May 13, 2024 at 16:51
-
$\begingroup$ @Alberto, I think the "total" is over the "rewards" $r^{(a)} + r^{(b)}$. These appear not to be rewards in the usual MDP model sense, but some other observed values that are taking a similar role to MDP reward. $\endgroup$Neil Slater– Neil Slater2024-05-13 20:31:24 +00:00Commented May 13, 2024 at 20:31
Add a comment
|