Hidden Gem Finder - Product Requirements Document
Build a machine learning system that identifies undervalued football players in lower European leagues who have high potential to succeed at top-5 leagues. The system will provide scouts and analysts with data-driven recommendations, complete with explanations and similar player comparisons.
- Top football clubs spend millions on scouting networks
- Most "hidden gems" are discovered too late or missed entirely
- Human scouts can only watch a fraction of available matches
- Lower league data is underutilized despite being publicly available
- Historical data shows patterns: players who break out share common statistical profiles
- ML can process thousands of players simultaneously
- Objective metrics complement subjective scouting reports
- Early identification = lower transfer fees, competitive advantage
| Metric | Target | Rationale |
|---|---|---|
| ROC-AUC | > 0.75 | Strong discriminative ability |
| Precision @ 20 | > 0.30 | 6+ actual breakouts in top 20 predictions |
| Recall @ 100 | > 0.50 | Catch half of all breakouts in top 100 |
| Brier Score | < 0.15 | Well-calibrated probability estimates |
| Explanation Coverage | 100% | Every prediction has SHAP explanation |
- System would have flagged known success stories (Haaland at Salzburg, Luis Diaz at Porto)
- Explanations make sense to domain experts
- Actionable output format for scouts
| Area | Details |
|---|---|
| Player Types | Outfield players only (FW, MF, DF) |
| Source Leagues | Eredivisie, Portuguese Liga, Belgian Pro League, Championship, Serie B, Ligue 2, Austrian Bundesliga, Scottish Premiership |
| Target Leagues | Premier League, La Liga, Bundesliga, Serie A, Ligue 1 |
| Training Period | 2015-2022 seasons |
| Validation Period | 2022-2023 season |
| Prediction Period | Current season |
| Age Range | 17-26 years old |
- Goalkeepers (different metrics entirely)
- Youth/academy players without senior minutes
- MLS, South American, Asian leagues (future extension)
- Real-time match analysis
- Video/computer vision features
| Source | Data Type | Priority | Notes |
|---|---|---|---|
| FBref | Advanced stats (per 90) | P0 | Primary source, most comprehensive |
| Transfermarkt | Market values, transfers, contracts | P0 | Required for target variable |
| Understat | xG, xA, shot maps | P1 | Attacking metrics depth |
| Sofascore | Match ratings, form | P2 | Supplementary signal |
| Capology | Wage data | P3 | Future enhancement |
player_id: str (primary key, generated UUID)
name: str
date_of_birth: date
nationality: str
primary_position: str (ST, LW, CM, CB, etc.)
secondary_positions: list[str]
height_cm: int (optional)
foot: str (optional)
player_id: str (foreign key)
season: str (e.g., "2023-24")
team: str
league: str
minutes_played: int
games_started: int
games_as_sub: int
# Attacking
goals: int
assists: int
xg: float
xa: float
npxg: float (non-penalty xG)
shots_per90: float
shots_on_target_pct: float
# Passing
passes_per90: float
pass_completion_pct: float
progressive_passes_per90: float
key_passes_per90: float
through_balls_per90: float
# Possession
progressive_carries_per90: float
successful_dribbles_per90: float
touches_in_box_per90: float
# Defensive
tackles_per90: float
interceptions_per90: float
blocks_per90: float
clearances_per90: float
aerials_won_pct: float
# Valuation
market_value_eur: int
wage_eur: int (if available)
contract_expires: date (if available)
transfer_id: str (primary key)
player_id: str (foreign key)
transfer_date: date
season: str
from_team: str
from_league: str
to_team: str
to_league: str
transfer_fee_eur: int (0 for free, -1 for loan)
is_loan: bool
A player is labeled as "breakout" (y=1) if within 3 seasons they:
- Transferred to a top-5 league, AND
- Played 900+ minutes there (not a failed move)
def create_label(player_id, observation_season, transfers_df, minutes_df):
TOP_5 = ["Premier League", "La Liga", "Bundesliga", "Serie A", "Ligue 1"]
MIN_MINUTES = 900
LOOKFORWARD_YEARS = 3
# Find transfers to top-5 within window
future_transfers = transfers_df[
(transfers_df.player_id == player_id) &
(transfers_df.season > observation_season) &
(transfers_df.season <= observation_season + LOOKFORWARD_YEARS) &
(transfers_df.to_league.isin(TOP_5))
]
if future_transfers.empty:
return 0
# Check meaningful playing time
top5_minutes = minutes_df[
(minutes_df.player_id == player_id) &
(minutes_df.league.isin(TOP_5)) &
(minutes_df.season > observation_season)
].minutes_played.sum()
return 1 if top5_minutes >= MIN_MINUTES else 0- Age at observation time
- Minutes played (current season)
- All per-90 statistics from schema
- Current market value
- Contract years remaining
| Feature | Formula | Rationale |
|---|---|---|
| xG Overperformance | goals - xG | Clinical finishing ability |
| xA Overperformance | assists - xA | Creative quality beyond chance creation |
| League Difficulty Coefficient | UEFA/Elo based multiplier | Normalize across leagues |
| Age-Position Percentile | Rank within age band + position | Context for raw numbers |
| YoY Growth | (stat_current - stat_prev) / stat_prev | Trajectory matters |
| Minutes Growth | minutes_current / minutes_prev | Rising importance |
| Consistency Score | 1 - std(match_ratings) / mean(match_ratings) | Scouts value consistency |
| Involvement Score | (touches + passes) / minutes * 90 | Central to team play |
| Goal Contribution per90 | (goals + assists) / 90 | Combined attacking output |
Based on historical transfer success rates:
LEAGUE_COEFFICIENTS = {
"Eredivisie": 0.85,
"Portuguese Liga": 0.82,
"Championship": 0.78,
"Belgian Pro League": 0.75,
"Serie B": 0.72,
"Ligue 2": 0.70,
"Austrian Bundesliga": 0.68,
"Scottish Premiership": 0.65,
}| Position Group | Primary Features |
|---|---|
| Forwards (ST, LW, RW, CF) | xG, npxG, shots, dribbles, touches in box |
| Midfielders (CM, CAM, CDM) | Progressive passes, key passes, xA, pass completion |
| Defenders (CB, LB, RB) | Tackles, interceptions, aerials, clearances |
┌─────────────────────────────────────────────────────────────────┐
│ DATA LAYER │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ FBref │ │Transfermarkt│ │ Understat │ │
│ │ Scraper │ │ Scraper │ │ Scraper │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Raw Data Lake │ │
│ │ (Parquet) │ │
│ └────────┬────────┘ │
└──────────────────────────┼──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ PROCESSING LAYER │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Data Cleaning │───▶│Feature Engineer │ │
│ │ & Merging │ │ │ │
│ └─────────────────┘ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Training Dataset │ │
│ │ (features + labels) │ │
│ └────────────┬────────────┘ │
└─────────────────────────────────┼───────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ MODEL LAYER │
├─────────────────────────────────────────────────────────────────┤
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ LightGBM │ │ XGBoost │ │ Logistic │ │
│ │ Classifier │ │ Classifier │ │ (baseline) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Ensemble Model │ │
│ │ (weighted avg) │ │
│ └────────┬────────┘ │
└──────────────────────────┼──────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT LAYER │
├─────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ SHAP Explainer │ │ Player Similarity│ │ Streamlit App │ │
│ │ │ │ Engine (cosine) │ │ + PDF Reports │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
hidden-gem-finder/
├── CLAUDE.md # AI assistant instructions
├── PRD.md # This document
├── README.md # Project overview
├── requirements.txt # Python dependencies
├── pyproject.toml # Project config
│
├── config/
│ ├── leagues.yaml # League metadata, coefficients
│ ├── features.yaml # Feature definitions
│ └── model_params.yaml # Hyperparameters
│
├── data/
│ ├── raw/ # Untouched scraped data
│ │ ├── fbref/
│ │ ├── transfermarkt/
│ │ └── understat/
│ ├── processed/ # Cleaned, merged data
│ │ ├── players.parquet
│ │ ├── player_seasons.parquet
│ │ └── transfers.parquet
│ └── outputs/ # Predictions, reports
│
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_feature_engineering.ipynb
│ ├── 03_model_training.ipynb
│ ├── 04_evaluation.ipynb
│ └── 05_predictions.ipynb
│
├── src/
│ ├── __init__.py
│ ├── scrapers/
│ │ ├── __init__.py
│ │ ├── base_scraper.py
│ │ ├── fbref_scraper.py
│ │ ├── transfermarkt_scraper.py
│ │ └── understat_scraper.py
│ ├── data/
│ │ ├── __init__.py
│ │ ├── cleaning.py
│ │ ├── merging.py
│ │ └── labeling.py
│ ├── features/
│ │ ├── __init__.py
│ │ ├── engineering.py
│ │ ├── selection.py
│ │ └── league_adjustment.py
│ ├── models/
│ │ ├── __init__.py
│ │ ├── trainer.py
│ │ ├── evaluator.py
│ │ └── predictor.py
│ ├── explainability/
│ │ ├── __init__.py
│ │ ├── shap_analysis.py
│ │ └── similarity.py
│ └── reporting/
│ ├── __init__.py
│ ├── dashboard.py
│ └── pdf_report.py
│
├── tests/
│ ├── __init__.py
│ ├── test_scrapers.py
│ ├── test_features.py
│ ├── test_models.py
│ └── fixtures/ # Mock data for tests
│
├── notes/ # Session notes, decisions
│ └── .gitkeep
│
└── outputs/
├── models/ # Saved model artifacts
├── reports/ # Generated PDFs
└── predictions/ # Prediction CSVs
| Component | Technology | Rationale |
|---|---|---|
| Language | Python 3.11+ | ML ecosystem |
| Data Processing | Pandas, Polars | Efficient dataframes |
| ML Framework | scikit-learn, LightGBM, XGBoost | Industry standard |
| Hyperparameter Tuning | Optuna | Efficient search |
| Explainability | SHAP | Global + local explanations |
| Dashboard | Streamlit | Rapid prototyping |
| Data Storage | Parquet | Efficient, typed storage |
| Testing | pytest | Standard testing |
| Linting | ruff | Fast, comprehensive |
Goals:
- Project infrastructure
- Working scrapers for all sources
- Raw data for 2015-2024
Deliverables:
- Git repo with structure
- Virtual environment + requirements.txt
- FBref scraper with tests
- Transfermarkt scraper with tests
- Understat scraper with tests
- Raw data collected and stored
- Data exploration notebook
Success Criteria:
- All scrapers pass tests with mock responses
- Can scrape a full season in <30 minutes per source
- Raw data validates against expected schemas
Goals:
- Merge datasets reliably
- Construct target variable
- Build feature pipeline
Deliverables:
- Player matching/merging logic
- Target variable construction
- Feature engineering pipeline
- League adjustment implementation
- Feature selection analysis
- Final training dataset
Success Criteria:
- Player matching achieves >95% precision
- No temporal leakage in features
- Feature importance analysis completed
- Dataset ready for modeling
Goals:
- Train performant models
- Robust evaluation
- Ensemble if beneficial
Deliverables:
- Train/val/test splits (walk-forward)
- Baseline logistic regression
- Tuned LightGBM model
- Tuned XGBoost model
- Ensemble implementation
- Comprehensive evaluation notebook
Success Criteria:
- ROC-AUC > 0.75 on test set
- Precision @ 20 > 0.30
- Models validated on 2022-2023 holdout
- Case study validation (known breakouts)
Goals:
- Interpretable predictions
- Player comparison system
Deliverables:
- SHAP integration
- Global feature importance
- Local explanations per player
- Player embedding space
- Similarity search function
- Explanation validation with domain knowledge
Success Criteria:
- Every prediction has explanation
- Explanations pass sanity check
- "Similar to X" comparisons are sensible
Goals:
- Production prediction pipeline
- User-facing interface
- Documentation
Deliverables:
- Prediction pipeline for current season
- Streamlit dashboard
- PDF report generation
- Complete README
- Test coverage >80%
- Final predictions for current season
Success Criteria:
- Dashboard loads in <5 seconds
- PDF reports generate correctly
- End-to-end pipeline runs without intervention
- Documentation complete
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Player name matching fails | High | High | Fuzzy matching + multiple identifiers (DOB, team) |
| Severe class imbalance | High | Medium | SMOTE, class weights, precision-focused metrics |
| Data quality in lower leagues | Medium | High | Imputation, uncertainty flags, exclude worst leagues |
| Model overfits to specific leagues | Medium | High | League as feature, cross-validate across leagues |
| Scraping blocked/rate limited | Medium | Medium | Delays, caching, respect ToS |
| Feature leakage | Medium | Critical | Time-based splits, feature audit checklist |
| Transfermarkt values unreliable | Medium | Medium | Use as feature, not ground truth |
from sklearn.metrics import (
roc_auc_score,
precision_score,
recall_score,
brier_score_loss
)
def evaluate_model(y_true, y_pred_proba, k=20):
"""Comprehensive evaluation suite."""
# Sort by predicted probability
top_k_idx = np.argsort(y_pred_proba)[-k:]
metrics = {
"roc_auc": roc_auc_score(y_true, y_pred_proba),
"precision_at_k": y_true[top_k_idx].mean(),
"recall_at_100": y_true[np.argsort(y_pred_proba)[-100:]].sum() / y_true.sum(),
"brier_score": brier_score_loss(y_true, y_pred_proba),
}
return metricsdef walk_forward_cv(df):
"""Time-based cross-validation."""
folds = [
{"train": ["2015-16", "2016-17", "2017-18", "2018-19"],
"val": "2019-20", "test": "2020-21"},
{"train": ["2015-16", "2016-17", "2017-18", "2018-19", "2019-20"],
"val": "2020-21", "test": "2021-22"},
{"train": ["2015-16", "2016-17", "2017-18", "2018-19", "2019-20", "2020-21"],
"val": "2021-22", "test": "2022-23"},
]
return foldsMust validate that model would have flagged:
- Erling Haaland (Salzburg 2019)
- Luis Diaz (Porto 2021)
- Darwin Nunez (Benfica 2021)
- Cody Gakpo (PSV 2022)
- Alexis Mac Allister (Brighton via Argentinos)
- Moises Caicedo (Brighton)
- Position-specific models
- Wage-adjusted value metric
- Injury history integration
- Contract situation features
- StatsBomb event data
- Video highlight embeddings
- Social/news sentiment signals
- Agent/intermediary network data
- Real-time weekly updates
- Betting market integration (true "undervalued")
- API for third-party integrations
- Mobile app for scouts
| Player | From | To | Season | Fee |
|---|---|---|---|---|
| Erling Haaland | RB Salzburg | Dortmund | 2019-20 | €20M |
| Luis Diaz | Porto | Liverpool | 2021-22 | €45M |
| Darwin Nunez | Benfica | Liverpool | 2022-23 | €75M |
| Cody Gakpo | PSV | Liverpool | 2022-23 | €42M |
| Viktor Gyokeres | Coventry | Sporting | 2023-24 | €20M |
| Khvicha Kvaratskhelia | Dinamo Batumi | Napoli | 2022-23 | €10M |
| League | Coefficient | Our Adjustment |
|---|---|---|
| Premier League | 90.5 | 1.00 (baseline) |
| La Liga | 76.6 | 0.95 |
| Bundesliga | 73.4 | 0.93 |
| Serie A | 68.7 | 0.90 |
| Ligue 1 | 56.4 | 0.85 |
| Eredivisie | 54.9 | 0.85 |
| Portuguese Liga | 51.4 | 0.82 |
| Belgian Pro League | 36.5 | 0.75 |
POSITION_GROUPS = {
"FW": ["ST", "CF", "LW", "RW", "SS"],
"MF": ["CM", "CAM", "CDM", "LM", "RM", "AM", "DM"],
"DF": ["CB", "LB", "RB", "LWB", "RWB", "SW"],
}| Version | Date | Author | Changes |
|---|---|---|---|
| 0.1 | 2026-02-03 | Initial | Document creation |