Skip to content

Cosine Similarity with spam message Feature Data Leakage #1

@blazysecon

Description

@blazysecon

Hi. Great tutorial. Just a quick note on Session 11: when creating cosine similarities with spam message feature on training data you should exclude the observation itself from the spam messages list:

# cosine similarities with spam messages and vice versa!
spam.indexes <- which(train$Label == "spam")
train.svd$SpamSimilarity <- rep(0.0, nrow(train.svd))
for(i in 1:nrow(train.svd)) {
    spam.indexesCV <- setdiff(spam.indexes,i)
    train.svd$SpamSimilarity[i] <- mean(train.similarities[i, spam.indexesCV])
}

This solves the data leakage problem leading to over-fitting. The RF results on test data with updated feature are much better:

 # Drill-in on results
 confusionMatrix(preds, test.svd$Label)
Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  1445   32
      spam    2  192
                                      
               Accuracy : 0.98          
                 95% CI : (0.972, 0.986)
    No Information Rate : 0.866         
    P-Value [Acc > NIR] : < 2e-16       
                                        
                  Kappa : 0.907         
 Mcnemar's Test P-Value : 0.000000658   
                                        
            Sensitivity : 0.999         
            Specificity : 0.857         
         Pos Pred Value : 0.978         
         Neg Pred Value : 0.990         
             Prevalence : 0.866         
         Detection Rate : 0.865         
   Detection Prevalence : 0.884         
      Balanced Accuracy : 0.928         
                                        
       'Positive' Class : ham           
                            

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions