Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
|
sim_matrix_text_visual = self.get_similarity_logits(sequence_output_alm, visual_output_alm, |
Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?
UniVL/modules/modeling.py
Line 258 in 0a7c07f