Google released and teased a few open source models in its launch today. One actually-released model is a vision language model based on SigLIP. It is extremely easy to tune and extend to a variety of tasks. This Colab Notebook shows how to do so with clean, readable code.
Wednesday, May 15, 2024Google has introduced PaliGemma, an open-source vision language model with multimodal capabilities that outperforms its contemporaries in object detection and segmentation. Optimized for fine-tuning on specific tasks, PaliGemma opens possibilities for custom AI applications and provides comprehensive resources for immediate use. PaliGemma demonstrates superior results in OCR and shows promise for various use cases when fine-tuned with custom data.
Microsoft's new many-to-many vision model can be tuned for specific downstream tasks. It isn't quite as powerful as PaliGemma, but is easy to run in PyTorch.
PaliGemma is a strong vision language model built on SigLIP and Gemma 2B. This technical report shows many of the decisions made in architecture choices and data collection.