Fine-tuning Vision-Language Models for agricultural understanding — domain-adapting Google's SigLIP to accurately identify crop diseases, achieving a 32% improvement in zero-shot retrieval accuracy.
Agriculture relies heavily on visual diagnosis. General-purpose Vision-Language Models like CLIP lack specific "agronomic literacy" — they might identify an image as "a green leaf," but fail to detect specific pathologies such as Cercospora Leaf Spot or distinguish Early Blight from Late Blight.
We fine-tune google/siglip-base-patch16-224 on curated agricultural image-text pairs. Unlike standard CLIP which uses softmax loss requiring massive batch sizes, SigLIP uses a pairwise sigmoid loss — enabling memory-efficient training and better performance on smaller, high-quality scientific datasets.
Three-step pipeline from data preparation to evaluation.
Aggregated image-text pairs of healthy vs. diseased crops. Images resized to 224×224 and normalized. Dataset split 80/10/10 for train, validation, and test.
Frozen vision encoder (lower layers) with trainable attention heads and projection layers. AdamW optimizer, learning rate 5e-6 with cosine decay, mixed precision (BF16/FP16).
Evaluated using Recall@K (R@1, R@5, R@10) for image-text retrieval, compared against the zero-shot baseline of the original Google checkpoint.
Designed for scientific rigor and real-world agricultural use.
Injects agronomic knowledge into a general-purpose vision-language model.
SigLIP's pairwise loss enables efficient training on smaller scientific datasets.
Low learning rate preserves general visual features while learning domain specifics.
BF16/FP16 training for VRAM efficiency on consumer-grade GPUs.
Significant improvement in zero-shot retrieval accuracy after fine-tuning.
Known constraints guiding future research directions.
Better performance on daylight field images; reduced accuracy under low-light or night conditions.
Rare diseases with fewer than 50 samples show lower retrieval scores compared to common diseases like Corn Rust.
Current implementation works best with leaf crops rather than fruits, due to the training data composition.
Agri-SigLIP is open-source under the MIT License. Clone the repository, prepare your dataset, and start training or running inference on crop disease images.