Joint loss in pretraining

Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
https://github.com/microsoft/UniVL/blob/0a7c07f566a3b220731f4abcaa6e1ee59a686596/modules/modeling.py#L258