This project implements a simple tabular Q‑learning agent to solve a CartPole-like environment in Rust. The agent discretizes the continuous 4D state space into buckets and learns a Q-table mapping state → action values, using ε‑greedy exploration.
BUCKETS: Array of 4 numbers, each specifying how many discrete bins per observation dimension.LOWER_BOUNDS,UPPER_BOUNDS: Continuous min/max values for each dimension (cart position, cart velocity, pole angle, pole angular velocity).ALPHA,GAMMA,EPSILON,EPISODES: Learning rate, discount factor, exploration rate, and number of training episodes.
- Stores current state as a 4‑element array
[x, ẋ, θ, θ̇]. reset(): Initializes state with small random noise.step(action): Applies physics: simple force-based update for position and angle. Returns(new_state, reward = 1.0, done_flag).
- Converts each continuous state to a discrete index using min/max normalization and the bucket count.
- Clamps indices so they stay within
[0 .. BUCKETS[i]‑1].
- Stored as a
HashMap<State, [f64; 2]>, whereState = (usize, usize, usize, usize). - Initialized lazily when encountering new states.
-
Loop for
EPISODES, each:-
Reset environment.
-
While not done and steps < 500:
-
Discretize current state.
-
Choose an action:
- With probability
EPSILON: random action (exploration). - Else: pick action with higher Q-value (greedy).
- With probability
-
Execute action and observe
(next_state, reward, done). -
Discretize next state and initialize Q entry if needed.
-
Q-learning update:
td_target = reward + GAMMA * max(Q[next_state]) td_error = td_target – Q[current_state][action] Q[current_state][action] += ALPHA * td_error -
Update the current state and increment step counter.
-
-
Every 100 episodes, prints episode number and steps survived.
-
-
After training, runs 5 evaluation episodes:
- Reset environment.
- Always pick greedy action (no exploration).
- Run until done or 500 steps, and print steps survived per evaluation.
A Rust implementation of a tabular Q‑learning agent solving a simplified CartPole environment using state discretization.
- Requires Rust and Cargo.
- Add
rand = "0.x"to yourCargo.toml.
cargo run --releaseThis will train the Q‑learning agent over 1000 episodes and then evaluate it with 5 test runs, printing survival steps.
CartPolestruct: Simulates pole physics and returns next state, reward, and terminal signal.discretize()function: Maps continuous observations into discrete indices for Q‑table lookup.- Q‑table: A
HashMapwhere keys are discretized state tuples and values are Q-values for two actions. - Learning loop: Implements ε‑greedy policy, Q‑learning update rule, and state transitions.
- Evaluation loop: Tests the learned policy without exploration.
- Discretization resolution: You can increase
BUCKETSsize for finer state representation, though that increases Q‑table size. - Hyperparameter tuning: Adjust
ALPHA,GAMMA,EPSILON, and number of buckets or episodes to improve learning. - Alternative policies: Replace ε‑greedy with decaying ε or Boltzmann exploration.
- Function approximation: For a more advanced version, consider replacing the Q-table with a neural network (DQN).