Upload an image and generate a descriptive caption using Vision-GPT (GPT-2 + ViT + Cross-Attention trained on Flickr8k)
Help improve the model! Enter the correct caption and we'll calculate similarity.