What vision AI could be used for in the future
Large language models, or LLMs, are the AI engines behind Google’s Gemini, ChatGPT, Anthropic’s Claude, and the rest. But they have a sibling: VLMs, or vision language models.

At the most basic level, large language model AI works by predicting the next word.
“Each word stands on its own and then you can, you can build a model on top of that, what word comes next?” explained Siva Reddy, professor of computer science and linguistics at McGill University.
What, then, are pictures made of? They might be worth a thousand words, but they aren’t made of actual words. So researchers had to figure out how an AI is supposed to understand them.
“An image has so many details in it, what would you focus on? It’s a much, much harder problem than generating language or anything else,” Reddy said.
The answer, researchers figured out, was smaller images — little tiny patches of an image.
“You might say it's a sky, which has tree leaves, some leaves in it, and some clouds in it. And you break this image into these patches and you treat each patch as a word.”
Once images became words, the power of language model AI could start to be unleashed on them, and we get what are called vision language models — vision AI.
“We've just seen a proliferation of usage in areas that I don't think we ever would have had the imagination to think of,” said Jay Allen, CEO of m87 labs, which created an open source vision language model called Moondream.
“There was a rancher who owned just an incredible amount of tracts of land, and once in a while, cows would get lost, and the rancher used Moondream to have a drone fly around and to try to identify errant cows,” Allen recalled.
M87 labs also trained its AI not just to look, but to look at people looking — something called gaze detection.
“And we had people start building retail applications where they could start collecting retail telemetry data to understand when somebody goes into a store, what do they look at and what tends to get the most types of eyeballs,” Allen said.
Yet it took six months to get their AI to learn how to read pressure gauges in oil fields. So vision learning model-based AI is at an inflection point, Allen said, but progress isn’t quite as lightning fast as it has been with other AIs.
“Vision is more complicated and more difficult, and so the improvements are coming at a slower pace over a longer period of time,” he said.
In part, this is because the human brain does a ton of visual processing just on its own, in ways that Allen said are not fully understood. Teaching computers to recreate that visual reasoning and understanding is tricky.
One challenge is how to train these models. We have civilization-sized amounts of writing for large language models to learn from. Images with a ton of context and a detailed explanation of what’s going on? Not nearly as much.
“There aren't tomes and tomes and internet full of data that explains in painstakingly detailed ways how to interpret vision, because the human brain just does it,” Allen said. “We don't need to write it down. But with books, there are just endless tomes and endless internet data that explain kind of more in a more straightforward way. Text is much more closely related to reasoning than images are to kind of visual understanding.”
The visual data we do put on the internet for AIs to learn from is also narrow compared to what’s out there in text.
“The type of data is very limited. Imagine what people upload to the internet represents maybe a very small fraction of what we see very day,” said Mengye Ren, an assistant professor of computer science and data science at NYU.
He said it can be hard for AIs to generalize — like a self-driving car knows what a stop sign is, but could get confused by a graffiti-covered stop sign or a scooter that’s driving weirdly.
Even though AI models have seen so much data on the internet, once they are deployed into the world in, say, a car or eyeglasses or robots, “they may still lack some knowledge in terms of how to control themselves.”
Their scope can be culturally specific, said Siva Reddy. “They don't know much about the non-Western-centric world. I come from India, and if I show these models some Indian concepts, they would have very hard time saying, ‘What’s this?’ They would know that, ‘Oh, it's food.’ They would say, ‘It's maybe a party.’ They would not be able to say, ‘Oh, this is specifically wedding, or this is idli or dosa.’”
So some companies are training their vision AIs themselves, and then combining those AIs with the existing large language models like those from OpenAI, Anthropic, or Google.
One example is Drone Deploy, which uses AI to monitor safety hazards at construction sites and track construction progress.
“It understands the difference between, say, ladders, structures, dry wall insulation, fire suppression, those kinds of things,” said James Stripe, chief product officer at DroneDeploy. “And then there's another layer which has a much more detailed understanding of the world, and that's the layer that applies judgment.”
Judgement like, “This ledge should be blocked off so someone doesn’t fall off of it.”
What then does the future hold for vision learning models? Imagine robots that understand the visual world enough to fold laundry, or eyeglasses that can tell you where you last saw that bracelet you can’t find — or things we haven’t imagined at all.

!["I think [AI] is really cool. There is stuff out there that is fun to watch," said Bella Falco of Denver, Colorado. "There are also things that starting to really scare me, like fake creators."](https://img.apmcdn.org/cb0a9a7e54db934026285b941f4b74ded3dab5ea/widescreen/53f6b2-20251113-bella-falco-sitting-on-a-striped-couch-with-a-mug-600.jpg)
