I've surprisingly had good success feeding the description generated by llava into gemma:2b (see attached image), but I feel like the right approach would be to refine my initial llava prompt to just produce the right result in one go 🤔.
Spent a few nights this week building a simple menubar app that writes alt text for images using llava 1.6. The descriptions are really accurate, but IMO they're often far too long and flowery to be good alt text, so my focus now is shaping the description to be better alt text.