Machine Learning Trading


5 minute read

Notice a tyop typo? Please submit an issue or open a PR.



One problem with Q-learning is that it takes many experience tuples to converge. Gathering experience from the real world is expensive because it requires us to take real action - executing a live trade in our case - to gather information.

To address this problem, researchers developed a technique called Dyna. Dyna works by building models of the transition function, TT, and reward function, RR. Then, after each real interaction with the world, we hallucinate many additional interactions, typically a few hundred, which we use to update the Q-table.

Dyna-Q Big Picture

Dyna-Q is an algorithm developed by Richard Sutton intended to speed up learning, or policy convergence, for Q-learning. Remember that Q-learning is a model-free method, meaning that it does not rely on, or even know, the transition function, TT, and the reward function, RR. Dyna-Q augments traditional Q-learning by incorporating estimations of both TT and RR, based on experience.

Let's quickly recap the Q-learning algorithm we've been using thus far. We first initialize our Q-table, and then we begin iterating. We observe a state, ss, execute an action, aa, after which we observe a new state, ss', and a reward, rr. Using this experience tuple, we update our Q-table. Finally, we repeat the process, continuously gaining experience from the world and improving our Q-table.

When we augment Q-learning with Dyna-Q, we add three new pieces. First, we add logic that enables us to build models of TT and RR. Then, for lack of a better term, we hallucinate a number of experiences. Finally, we update our Q-table according to the experience tuples we generated during the hallucination. We repeat these steps potentially several hundred times for each real experience.

When we talk about updating our model, what we want to do is find new values for TT and RR. Remember that TT is the probability that if we are in state ss and take action aa we will end up in state ss', while RR is our expected reward from taking action aa in state ss.

Before we talk about how we hallucinate an experience, it's essential to understand why we might want to do this in the first place. The issue is that interacting with the real world can be very expensive, in terms of time, money, or some other resource, while hallucinations are cheap. By hallucinating a large number of experiences - say 100 or 200 - for every real experience, we can amortize the cost of a real experience over many Q-table updates.

We can hallucinate an experience in two steps. First, we randomly select a state, ss, and an action, aa. Next, we infer the new state, ss', using TT, and we infer the reward, rr, using RR. Using this synthetic experience tuple, we can update our Q-table.

Learning T

Remember that T[s,a,s]T[s,a,s'] represents the probability that if we are in state, ss, and we take action, aa, we end up in state, ss'. To learn a model of TT, we need to observe and record the frequency with which state transitions occur.

We can introduce a new table, TcT_c, of dimensions, (S,A,S)(S,A,S), where SS is the number of total possible states, and AA is the number of total possible actions. We initialize the cells in TcT_c to a very small number to avoid a potential divide-by-zero situation.

As we iterate through the Q-learning process, we accumulate experience tuples. For each ss, aa, and ss' that we acquire, we increment the count in Tc[s,a,s]T_c[s, a, s']. In this fashion, we can record the frequency of state transitions, which serves as an empirical model of TT.

How to Evaluate T Quiz

Assume we have been interacting with the real world for a while, and we would like to consult our model of TT. Can you write an equation for TT in terms of TcT_c?

How to Evaluate T Quiz Solution

NOTE: The denominator in this equation should reference TcT_c, not TT.

What we want to find here is the probability of a particular resulting state, ss', given a current state, ss and an action, aa. Thus, we need a fraction where the numerator is the number of transitions from ss to ss', by way of aa, and the denominator is the total number of transitions out of ss, by way of aa.

Let's consider the numerator first. The total number of transitions from ss to ss', as a result of aa, is simply the recorded value, Tc[s,a,s]T_c[s,a,s'].

Next, let's consider the denominator. The total number of times we transitioned out of ss by taking aa is the sum Tc[s,a,s1]+Tc[s,a,s2]+...+Tc[s,a,sn]T_c[s,a,s_1] + T_c[s,a,s_2] + ... + T_c[s,a,s_n], where nn is the size of SS, the state space.

Altogether, we have the following equation:

T[s,a,s]=Tc[s,a,s]inTc[s,a,si]T[s,a,s'] = \frac{T_c[s,a,s']}{\sum_{i}^{n}T_c[s,a,s_i]}

Learning R

In addition to learning a model of TT, we also need to learn a model of RR. Remember that R[s,a]R[s,a] gives us the expected reward for taking an action, aa, from a state, ss. Recall also that each experience tuple contains an immediate reward, rr, for taking the corresponding action from the corresponding state.

Given this, we can formulate a simple equation for RR, which we update after each real experience. Notice that this update equation is very similar to the Q-table update equation, incorporating the same learning rate, α\alpha, that we used in that equation.

R[s,a]=(1α)R[s,a]+αrR'[s,a] = (1 - \alpha) * R[s,a] + \alpha * r

Dyna-Q Recap

OMSCS Notes is made with in NYC by Matt Schlenker.

Copyright © 2019-2023. All rights reserved.

privacy policy