Snakes on a hyperplane: Python machine learning in production

Speaker: Jessica Lundin

Different populations in training vs. production

Benchmark your model's performance. It should be worse in production. If it's better, that's a red flag that something might be broken.

Unknown production distribution: you don't know if your production user population is significantly different from what you tested the model on.

Ways to handle this:

retrain with non-linear algorithms ("pull out a bigger hammer")
feature engineering to linearize features + You generally want the simplest model that gives higher accuracy. Linear methods, while less shiny, are generally better: more performant, easier to understand.
techniques for suspected distribution differences b/w preprod adn prod:
- visualization (histograms, pair plots)
- clustering
- Kullback-Leibler (KL) divergence <-- NOTE TO SELF: LOOK THIS UP

(Where one class vastly outnumbers the other)

It's hard to characterize rare events: the model may fit to say "everything is just the majority class".

Ways to handle this:

random over-sampling with replacement <--supervised
importance weighting <-- supervised
boosting <-- supervised
Treat it as an anomaly detection problem (one-class SVM) <-- unsupervised approach

Transform your input data into a higher dimensional space (the 'kernel trick' in SVM)

Caffe
The elements of statistical learning (free book) by Hastie, Tibshirani, Friedman, http://statweb.stanford.edu/~tibs/ElemStatLearn/