Skip to content
View jsanchez-ds's full-sized avatar
  • Santiago, Chile

Block or report jsanchez-ds

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
jsanchez-ds/README.md

Hi, I'm Jonathan Sánchez 👋

Data Scientist | Marketing Analytics | Machine Learning

Industrial Engineer from Universidad de Chile (distinción máxima) currently working at ClaroVTR as Efficiencies Engineer — building end-to-end forecasting systems with XGBoost / LightGBM / Prophet ensembles for enterprise clients (~$30M CLP/month identified savings).

CV


🚀 Featured Projects

End-to-end quantile ML pipeline for monthly data-quota forecasting and per-subscriber plan optimization — a portfolio reproduction of a production system I built at work. Combines XGBoost Quantile Regression (P90 direct), LightGBM with a custom asymmetric loss (penalizes under-prediction 1.5×), DTW shape clustering and Prophet in a tier-based ensemble, then translates each forecast into a concrete topup-bag recommendation via an integer-pricing optimizer.

  • Quantile XGBoost with reg:quantileerror targeting P90 — calibrated for the asymmetric cost of under-prediction
  • DTW clustering (tslearn) on min-max-normalized series — groups subscribers by shape, not level
  • Behavior featuresabrupt_change, acceleration, consecutive_overage, pct_plan_last, partial-cycle detection
  • Tier-based ensemble — different model blends for low / mid / high / very_high volume subscribers
  • Pricing optimizer with property-based test suite — solves the small/large bag breakpoint
  • End-to-end demo runs on a synthetic 120-subscriber dataset; ~93% P90 coverage on the validation fold
  • Engineeringpyproject.toml with optional extras, OAuth2 client via env vars (no hard-coded secrets), 14 passing tests

Python · XGBoost · LightGBM · tslearn · Prophet · scikit-learn · xlsxwriter


End-to-end analysis on the UCI Bank Marketing dataset (45k calls from a Portuguese bank) to predict term-deposit subscription. Includes a v2 iteration that diagnoses and fixes a SMOTE-in-CV data leakage bug — the kind of methodology issue real production pipelines must catch.

  • EDA with PySpark — conversion rates by segment, temporal patterns, imbalance analysis
  • Modeling — Decision Tree, Random Forest, XGBoost with GridSearchCV
  • Best model: Random Forest, ROC-AUC 0.7959 on hold-out (with duration excluded to avoid leakage)
  • Key business insight: previously-contacted clients convert at 63.8% vs 9.3% for new ones — 7× more likely to subscribe

PySpark · scikit-learn · XGBoost · imbalanced-learn · Developed in Databricks


Quantifying the gender income gap among ~5,000 small merchants in Latin America (~245k weekly observations) using transactional data from a digital payments platform.

  • Fixed-effects regression with fixest::feols — progressive controls for hours, business category, zone and age brackets
  • Regularized models — Ridge / LASSO via glmnet, Backward / Forward selection
  • Machine learning — CART, MARS, KNN, Random Forest (caret-tuned)
  • Key finding: raw gap ≈ 20.7%, of which a substantial part is mediated by hours worked and business category — but a meaningful hourly-productivity gap persists after controls

R · fixest · glmnet · caret · earth · randomForest


Discrete-choice analysis of how the visual salience of credit terms in digital advertising affects consumer choices. Built on a randomized experiment with 4 ad-design conditions (control + 3 treatments emphasizing financial information at increasing levels) and 6 binary choices per participant.

  • Conditional logit and mixed logit with mlogit — including unobserved heterogeneity via random coefficients
  • ML comparison — CART, SVM, KNN, Random Forest via caret
  • Key finding: simple logits show no treatment effect, but the mixed logit reveals a significant T3 effect once unobserved heterogeneity is allowed — a reminder that model choice can change a policy answer

R · mlogit · caret · randomForest · Discrete choice modeling


🛠️ Tech Stack

Languages: Python · R · SQL ML / Stats: scikit-learn · XGBoost · LightGBM · Statsmodels · mlogit · fixest Data: Pandas · NumPy · PySpark · Databricks Visualization: Matplotlib · Seaborn · ggplot2 · Plotly Tools: Git · Jupyter · RMarkdown · VS Code


🎓 About Me

  • 🏛️ Universidad de Chile — Industrial Engineering
  • 📊 Experienced in discrete choice modeling (Logit, Mixed Logit) and Machine Learning
  • 📍 Based in Santiago, Chile
  • 🌱 Currently exploring: MLOps, production ML pipelines, causal inference

📫 Connect

LinkedIn Email GitHub

Pinned Loading

  1. bank-marketing-analysis bank-marketing-analysis Public

    End-to-end analysis on the UCI Bank Marketing dataset (45k calls): EDA in PySpark, Decision Tree / Random Forest / XGBoost in scikit-learn, plus a v2 branch fixing SMOTE-in-CV leakage with imblearn…

    Jupyter Notebook

  2. credit-choice-experiment credit-choice-experiment Public

    Discrete choice modeling on a randomized credit-ad experiment: conditional logit, mixed logit with unobserved heterogeneity, and ML comparison (CART, SVM, KNN, RF) in R

  3. cv cv Public

    CV — Jonathan Sánchez Pesantes (LaTeX sources + compiled PDF)

    TeX

  4. gender-income-gap gender-income-gap Public

    Quantifying the gender income gap among ~5000 small merchants in Latin America using fixed-effects regression (fixest), Ridge/LASSO and ML (CART, MARS, KNN, Random Forest) in R