mjsushanth.github.io/late-interaction.html at main · mjsushanth/mjsushanth.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Vision-Language Late-Interaction Retrieval - Joel Markapudi</title>
    <style>
        * {
            margin: 0;
            padding: 0;
            box-sizing: border-box;
        }

        body {
            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen, Ubuntu, Cantarell, sans-serif;
            line-height: 1.6;
            color: #333;
            background: #f5f5f0;
        }

        .container {
            max-width: 920px;
            margin: 0 auto;
            padding: 50px 30px 40px 30px;
            background: #fafaf5;
            min-height: 100vh;
        }

        .back-link {
            display: inline-flex;
            align-items: center;
            color: #4a90e2;
            text-decoration: none;
            margin-bottom: 30px;
            font-size: 0.95rem;
            transition: color 0.3s;
        }

        .back-link:hover {
            color: #357abd;
        }

        h1 {
            font-size: 2.2rem;
            font-weight: 400;
            margin-bottom: 18px;
            color: #2c2c2c;
            line-height: 1.3;
        }

        .project-overview {
            color: #555;
            font-size: 1.05rem;
            margin-bottom: 25px;
            padding-bottom: 20px;
            border-bottom: 2px solid #e0e0d8;
            line-height: 1.7;
        }

        .highlights {
            margin-top: 25px;
        }

        .highlights ul {
            list-style: none;
            margin: 0;
            padding: 0;
        }

        .highlights li {
            margin-bottom: 16px;
            padding-left: 20px;
            position: relative;
            color: #555;
            line-height: 1.7;
        }

        .highlights li:before {
            content: "•";
            position: absolute;
            left: 0;
            color: #4a90e2;
            font-weight: bold;
            font-size: 1.2rem;
        }

        strong {
            color: #2c2c2c;
            font-weight: 500;
        }

        .external-links {
            margin-top: 35px;
            padding-top: 25px;
            border-top: 1px solid #e0e0d8;
        }

        .external-links h2 {
            font-size: 1.3rem;
            font-weight: 500;
            margin-bottom: 15px;
            color: #2c2c2c;
        }

        .external-links a {
            display: inline-block;
            margin-right: 20px;
            margin-bottom: 10px;
            color: #4a90e2;
            text-decoration: none;
            font-size: 1rem;
            transition: color 0.3s;
        }

        .external-links a:hover {
            color: #357abd;
            text-decoration: underline;
        }

        @media (max-width: 768px) {
            .container {
                padding: 35px 20px;
            }

            h1 {
                font-size: 1.8rem;
            }
        }
    </style>
</head>
<body>
    <div class="container">
        <a href="index.html" class="back-link">← Back to Portfolio</a>

        <h1>Vision-Language Late-Interaction Retrieval with LoRA</h1>

        <div class="project-overview">
            Implemented ColBERT-style late-interaction retrieval system combining ViT patch embeddings with CLIP text encodings, trained via parameter-efficient LoRA fine-tuning on 1500 Flickr8k image-caption pairs. Achieved 100% retrieval accuracy through MaxSim token→patch alignment with symmetric InfoNCE contrastive loss, demonstrating learned cross-modal semantic geometry.
        </div>

        <div class="highlights">
            <ul>
                <li><strong>Multi-Vector Architecture:</strong> Built late-interaction system using ViT-base/patch16 encoder producing 196 spatial patch tokens (768-d) from 224×224 images, paired with CLIP text encoder generating per-token embeddings, unified through learned projection head mapping 512-d CLIP space to 768-d ViT embedding space for geometrically meaningful cross-modal comparisons</li>

                <li><strong>MaxSim Retrieval Mechanism:</strong> Implemented ColBERT-style scoring computing per-token max-patch similarity with full [B×B] retrieval matrix generation, enabling fine-grained textual grounding to specific image regions rather than global pooled embeddings. Each query token independently aligns to most similar image patch, producing interpretable attention heatmaps for spatial localization</li>

                <li><strong>Parameter-Efficient Fine-Tuning:</strong> Applied LoRA adapters (rank=16, α=32, dropout=0.05) to ViT attention layers (qkv and proj modules), training only 1.3M of 87M total parameters (1.05% trainable). Preserved pretrained ViT spatial understanding while adapting patch representations toward CLIP text embedding manifold through low-rank weight perturbations</li>

                <li><strong>Contrastive Training Pipeline:</strong> Trained with symmetric InfoNCE loss over 30 epochs using AdamW optimizer (lr=3e-5, weight_decay=1e-2), ReduceLROnPlateau scheduler (patience=3), gradient clipping (max_norm=1.0), and early stopping (patience=6). Achieved 38% loss reduction from 0.858 (near-random log(8)≈2.08 baseline) to 0.534, demonstrating strong alignment convergence</li>

                <li><strong>Retrieval Validation:</strong> Evaluated on 8-sample test batches achieving perfect diagonal dominance (100% caption→image retrieval accuracy), with score matrices showing correct pairs consistently scoring 0.20-0.27 while mismatches scored below 0.15. Validated that model learned semantic correspondences (e.g., "dog" captions correctly distinguishing between multiple dog images based on contextual differences)</li>

                <li><strong>Interpretability Analysis:</strong> Generated token→patch MaxSim heatmaps visualizing spatial attention (e.g., "dog" token activating on dog regions, "sunlight" on bright areas), and t-SNE embeddings of pooled representations showing distinct image/text clusters due to late-interaction preserving modality-specific structure while enabling fine-grained cross-modal alignment through MaxSim rather than global embedding fusion</li>
            </ul>
        </div>

        <div class="external-links">
            <h2>Project Resources</h2>
            <a href="https://github.com/mjsushanth/mlops-labs-portfolio/blob/main/Late_Interaction_MVR_PEFT_LoRA/notebooks/01_MaxSimMVR_ViT_LoRA_Lab.ipynb" target="_blank">→ Full Notebook</a>
            <a href="https://github.com/mjsushanth/mlops-labs-portfolio/tree/main/Late_Interaction_MVR_PEFT_LoRA" target="_blank">→ GitHub Repository</a>
            <a href="https://github.com/mjsushanth/mlops-labs-portfolio/blob/main/Late_Interaction_MVR_PEFT_LoRA/README.md" target="_blank">→ Technical Documentation</a>
        </div>
    </div>
</body>
</html>