Available documentation in Romanian here
Authors: Bechea Flavia-Ioana, Radu Marian-Sebastian
Date: January 2026
This project addresses a ranking problem for upsell from the perspective of a restaurant owner. The goal is to increase sales by recommending relevant additional products (specifically sauces) to customers based on their current basket.
The core objective is to construct a hierarchy of candidate products (sauces) ordered by their estimated relevance to the customer and their potential revenue impact. Formally, for each candidate product
We evaluate the system by constructing a partial cart (removing a target sauce) and verifying if the top
The dataset consists of restaurant receipts from September to December 2025.
- Receipt Grouping: Raw data was grouped by
id_bon(receipt ID) so that each row represents a single transaction. - Feature Engineering:
- Binary Product Vectors: Columns for each product (e.g., Crazy Schnitzel, Fries) acting as binary indicators (1 if present, 0 otherwise).
- Temporal Features: Extracted
day_of_week(1-7) andhourfrom the timestamp to capture time-based preferences (e.g., weekend vs. weekday patterns). - Cart Statistics: Total value of the cart and the number of items.
- Target Variable: A binary variable indicating the presence of a specific sauce (e.g., Crazy Sauce, Garlic Sauce) in the receipt.
We explored several classification algorithms to predict the probability of a sauce being ordered:
- ID3 (Decision Tree)
- Naive Bayes
- Logistic Regression
- AdaBoost
Based on initial experiments, Naive Bayes and Logistic Regression showed the most promise, outperforming tree-based ensembles on this specific sparse, binary dataset. Consequently, we implemented these two algorithms from scratch to deepen our understanding of their mechanics.
- Naive Bayes: Implemented with Laplace smoothing to handle zero-frequency problems (unseen features in training). It assumes feature independence, which, while theoretically strong, works surprisingly well for sparse transaction data.
-
Logistic Regression: Implemented using Gradient Descent with L2 Regularization to prevent overfitting. It models the probability using the sigmoid function:
$P(y=1|x) = \sigma(w^T x + b)$ .
We compared our models against a Popularity Baseline, which simply recommends sauces based on their global frequency (ignoring the specific cart context).
The primary metric is Hit Rate @ K (for
As seen in the chart below, both Logistic Regression and Naive Bayes significantly outperform the baseline, especially at
Figure 1: Hit Rate comparison between Manual Logistic Regression, Manual Naive Bayes, and the Baseline.
- Logistic Regression provides the most stable and accurate ranking, benefiting from its ability to model dependencies between features without overfitting.
- Naive Bayes is a close second, proving robust to noise.
- ID3 and AdaBoost (not shown in the manual comparison above, but analyzed in preliminary tests) tended to overfit or struggle with the class imbalance inherent in the dataset.
Logistic Regression successfully captures the linear relationship between cart items and sauce preferences. The confusion matrix for Crazy Sauce (a popular item) shows a strong ability to correctly identify positive cases (True Positives) while maintaining a reasonable false positive rate.
Figure 2: Confusion Matrix for Crazy Sauce using Logistic Regression.
Naive Bayes excels in speed and robustness. Despite the "naive" assumption of independence, it correctly identifies patterns in the binary data.
Figure 3: Confusion Matrix for Crazy Sauce using Naive Bayes.
Our analysis shows that Logistic Regression is the most effective model for this upsell ranking task. It successfully identifies patterns between cart items and sauce preferences, significantly outperforming the simple popularity baseline.
While Naive Bayes also performed well, tree-based models (ID3, AdaBoost) were less effective due to the dataset's sparsity. Ultimately, leveraging Machine Learning to understand the cart context provides much more accurate recommendations than relying on global popularity alone.