Skip to content

[20230413] Weekly VLM2 - Flamingo #4

Description

@SoongE

Paper
Flamingo: a Visual Language Model for Few-Shot Learning (a.k.a. Flamingo)

Speaker
@SoongE

Summary
CleanShot 2023-04-13 at 16 31 25

Key Point

  • Powerful connection between pre-trained Vision and Language
  • Using visual texture data
  • Any input using Preceiver model
  • Well implemented on several tasks

Methods

  • Freezing Vision and Language model

    • Vision Encoder:
      • Train on contrastive learning using BERT
      • Train with ALIGN + LTIP by accumulation methods
    • Fine-tuning or scratch instead of freezing resultes in a very large performance drop. They attribute this to catastrophic forgetting that occurs as the learning objective is refreshed.
  • Peceiver Resampler
    CleanShot 2023-04-13 at 17 40 50

    • Return fixed output shape of vision input
    • Fixed shape of latent query
    • 실험적으로 기존 attention보다 좋다
  • Gated Cross-Attention
    CleanShot 2023-04-13 at 17 37 23

    • Tanh gate: Long short-term memory(LSTM)
      • normalization 효과
  • Train on mixture of datasets

    • Dataset의 양과 quality에 따라 weight를 다르게줬다. (M3W, ALIGN, LTIP and VTP with weights 𝜆𝑚 of 1.0, 0.2, 0.2 and 0.03 respectively.)
    • M3W: interleaved image-text
      • 43M HTML dataset
    • ALIGN and LTIP: image-text pair
      • ALIGN: large and low quality
      • LTIP: small and high quality
    • VTP: video-text pair
      • 27M with short video about 22sec

strengths and weaknesses

  • Strengths
    • 많은 downstream task에서 좋은 성능을 보임
  • Weaknesses
    • LM의 side effect를 모두 가져온다.
    • Classification은 CLIP보다 좋지 않다.
    • Few-shot이 아닐 경우에는 각자의 모델이 더 좋은 성능을 낼 수 있다.
    • 학습에 사용한 dataset이 매우 크고, 모델 자체의 사이즈가 매우 커서 공정한 비교가 힘들다.

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions