๐Ÿ–ผ๏ธ Vision-GPT Image Captioning

Upload an image and generate a descriptive caption using Vision-GPT (GPT-2 + ViT + Cross-Attention trained on Flickr8k)

10 100
0.1 2
0.1 1
0 100

๐Ÿ“Š Provide Feedback

Help improve the model! Enter the correct caption and we'll calculate similarity.


๐Ÿ“ˆ Feedback Results

Model Info

  • Architecture: GPT-2 (124M) + ViT-B/16 + Cross-Attention
  • Training: Flickr8k dataset (8k images)

Similarity Scoring

  • โ‰ฅ80%: ๐ŸŒŸ High Reward (+10)
  • โ‰ฅ60%: ๐Ÿ‘ Good Reward (+5)
  • โ‰ฅ50%: ๐Ÿค” Low Reward (+2)
  • <40%: ๐Ÿ˜ž Penalty (-5)