A very cool puzzle created by Radu Mariescu-Istodor (Lecturer in Computer Science at Karelia University of Applied Sciences) to decode drawings by analyzing video of a triangle of balls, captured from the top of a pencil.
https://radufromfinland.com/decodeTheDrawings/
Excluded directories (data)
drawings/- txt files with drawing datadrawings_ai/- txt files with drawing data generated by neural networkframes/- screenshots for visual comparisonmodels/- neural network model filesoutput/- generated plotsoutput_ai/- plots generated by neural networkvideo_locations/- metadata for simulated videos used to train neural networkvideos/- videos from Radu and simulation data
- Install Python 3.12+ (if not already installed, lower might also work)
- Install required Python packages:
pip install opencv-python numpy scipy matplotlib moviepy torch tqdm scikit-learn
- Install Node.js (if not already installed)
- Navigate to the JavaScript directory and install dependencies:
cd javascript npm install
- Place Radu's video files in the
videos/directory - Videos should be named as
1.mp4,2.mp4, etc., or1.webm,2.webm, etc.
-
Process video data: Run the main processing script. This extracts ball positions and audio data, and creates a plot + point data of the the result:
python process_data.py
-
Train neural network (optional): If you want to use the AI approach:
python train_nn.py
-
Make predictions: Use the trained model to predict drawings. This creates a plot + point data of the result:
python predict_nn.py
-
Run JavaScript simulation (optional): To validate approaches or generate training data:
cd javascript npm run devThis will start a Vite development server. Open the provided local URL (usually
http://localhost:5173) in your browser to run the 3D simulation.
Modify VIDEO to select the video to process, modify MODEL (predictions only) to select the video the model was
trained on:
VIDEO = "1" # Change to process different videos- Decoded drawings are saved as text files in
drawings/(traditional approach) ordrawings_ai/(neural network approach) - Visualization plots are generated in
output/oroutput_ai/ - Each line in the drawing files contains x,z coordinates of the pen position
- Adjust parameters in
global_settings.pyfor different camera setups or ball configurations - Modify filtering and processing parameters in the respective processing scripts
- Python (for analysis)
- JavaScript (for simulation)
- Three.js (3D simulation)
- SciPy (optimization and filtering)
- OpenCV (image analysis)
- MoviePy (audio analysis)
- Matplotlib (for visualization and result analysis)
- PyTorch (for the Neural Network)
A JavaScript simulation of the first drawing (a circle) was used to validate the solution on a known problem with no disturbances: the camera is always horizontal and pointing at the centroid.
The simulation was also used to validate the assumption that roll and pitch are less important than yaw. Update: roll cannot be ignored when combined with yaw.
The horizontal offset estimation was also validated using the simulation.
Both OpenCV and mathematical approaches failed at estimating the camera intrinsics - shame on me! :-(
I therefore used a ruler and visually estimated the horizontal field of view at 60 degrees, assuming square pixels and a principal point exactly at the center of the screen.
For image processing, OpenCV was used. Due to the camera projection of 3D balls onto a 2D surface, they appear as ellipses. OpenCV can find contours, fit ellipses, and extract the center and axes of these ellipses.
Under the assumption that the camera starts 18cm in front of the centroid of the equilateral triangle, the initial ball sizes in pixels of the minor axes are used to calculate the distances from the balls to the camera, using the fact that size is inversely proportional to distance. These 3 distances, combined with the known viewing angle and ball positions, are used to reconstruct the apex of the viewing 'pyramid'.
To correct camera rotation, the angle between the blue and green balls is used. This angle changes slowly due to perspective while moving. Sudden changes are therefore likely errors. These errors are detected by subtracting a low-pass filter from the measured data. The error angle is then reversed by rotating the coordinates of the red, green, and blue balls around the center of the screen before estimating the horizontal offset.
The assumption that the camera always points at the centroid is not true due to small aiming errors. Simulation experiments show that rotation (roll), which is very small, and the y-axis (pitch) are not as problematic. The x-axis (yaw) is the most important error to correct. We take advantage of the fact that vertical lines stay vertical under perspective projection (assuming roll is small). We can therefore use the x-positions of the blue, red, and green balls along with the x-position of the screen center, combined with the known triangle size, to calculate the camera's true center and find the horizontal offset. This is done using the cross-ratio of the 4 points, which is projectively invariant. This offset is then used to shift the pencil perpendicular to the viewing direction (this is not fully accurate, but helps reduce the error).
As mentioned earlier, the camera doesn't always point at the triangle's center. Using the camera intrinsics and estimated distance, the angle toward (or away from) the triangle can be calculated. Simple trigonometry then determines the offset toward or away from the triangle.
To detect when the pen is lifted, the audio track is analyzed. By calculating the dB level of the video and applying a low-pass filter to avoid reacting to brief sounds, it becomes clear when the pen is touching the paper versus when it's in the air.
Simulation data was used to train a neural network to predict positions in real videos. The results were not spectacular, but still much better than expected. With more training data, results improved, but the biggest improvement came from normalizing the data to the range [-1, 1] around the screen center. Reducing the network from 4 to 3 hidden layers also improved results, indicating there was (and might still be) overfitting to the simulated training data.
Fun fact: I also tried using real video 1 (under the assumption that Radu has drawn a perfect circle with constant speed) as training data for the network, and used it to predict the star. The result - a morphed star - was far from perfect, but at the same time far better than expected! Perhaps using the (perfect) five-pointed star as training data (tune the corners!) could be an affective approach.
Note: The AI has not learned to detect pen lifting; instead, I used the audio analysis approach as in the non-AI decodings.
- Color models: Compare different color models. RGB works well in simulation, but HSV might be better for real videos
- Filtering: Filtering makes images smoother but not necessarily more accurate. More experiments needed. Kalman filtering/sensor fusion might perform better than simple filtering
- Position estimation: Current decoding assumes the camera is held correctly, with corrections applied afterward. Estimating position without assumptions might be more robust
- Distance estimation: Currently based on the minor axis size of circles. More sophisticated methods exist in these papers (which I unfortunately couldn't get working):