Jul 31, 2023

The potential future of open-source generative AI

Tom Goldstein, computer science professor at the University of Maryland, explains the pros and cons of open-source generative AI models and why open approaches are unlikely to proliferate.

There’s a new large language model in town that threatens to out-open OpenAI’s ChatGPT.

LLaMa 2, from Facebook parent company Meta, has capabilities roughly in line with big-name competitors. However, it’s also open source, meaning the model’s source code is available for anyone to study or build upon for free.

OpenAI, Google and many other artificial intelligence innovators have opted to keep their latest models proprietary.

A more open approach has obvious benefits for research and enterprise but can also be advantageous for the companies that put these tools out.

Marketplace’s Meghan McCarty Carino spoke with Tom Goldstein, a professor of computer science at the University of Maryland, about the advantages and disadvantages of the open-source approach.

The following is an edited transcript of their conversation.

Tom Goldstein: One benefit that companies do get from open source is just that by open-sourcing tools and participating in the academic community, it helps increase public perceptions of their company. So, for example, Google has always been perceived as a very innovative, very technology-focused company that’s doing all sorts of new things. And one reason why is because they invest a lot in research and they do a really good job publicizing the research that they do. And by open-sourcing the research they do, they make really big impacts in the academic community. And that helps them not just in terms of the public perception of their company, but also when they want to hire people. It’s really easy to hire technical people when technical people know that all of the good open-source tools and a lot of the papers they read are coming out of Google. And now that Google has closed off a lot of its R&D because it’s in locked in competition with Microsoft, Facebook probably sees this as an opportunity to move into that space and try to take over that throne.

Meghan McCarty Carino: What are some of the risks with the open-source approach when it comes to these foundation models?

Goldstein: People are very concerned about malicious uses of language models. And by having open-source models, there are certain ways in which you might increase the rate at which those malicious uses happen. Some of these are more benign, but they’re certainly annoyances. You might go on Amazon and see tons of strange-looking reviews, many of which are probably generated by language models. Now, there could be other more malicious things all the way from automatically generated spam that’s very difficult to filter, sophisticated spear-phishing [targeted email cybercrime] campaigns, all the way to more things like spreading conspiracy theories, potentially election manipulation. Now, those malicious use cases, they can certainly be done with proprietary tools. For example, you could use something like the ChatGPT [application programming interface] to engage in these kinds of behaviors. But it becomes potentially more difficult to police them when you have open-source tools because then even if these platforms do some sort of moderation to keep those sorts of use cases from happening, people still have their own open-source models that they can run on their own computers, that they have their own control of, that makes things just easier for them. But there’s also pros as well. The academic community in particular can’t do research on language models very well when there are no open-source language models available. And so there’s a lot of scientific applications that the open-source community would like to use them for. There’s just a lot of creative applications that people will come up with when they have access to these kinds of tools. So there’s really trade-offs when you, when you open-source these kinds of models.

McCarty Carino: One question that’s come up a lot in reference to some of the more closed models is their training data. You know, there’s not a lot of transparency about what data they’ve been trained on, whether there might be, you know, some copyright issues involved in the training. Is it necessarily the case that open-source models have more transparency on what the training data is?

Goldstein: No, not necessarily. So historically, open-source models do tend to be more transparent about their data. And the LLaMA 1, the first version of the LLaMA model, was quite transparent. It actually showed exactly what their dataset was. With LLaMA 2, they changed that, so they decided to be a little bit more closed. And they’re actually just a lot more closed about the training process in general. They didn’t say anything about what data they used and where it came from. And that is likely because they’re concerned about liability issues. Historically, people have used web data to train all sorts of different models, and no one has really thought that this was problematic. But with generative models, there are organizations that are more concerned about this. There’s just a lot of uncertainty right now. The courts are eventually going to have to make decisions about what kind of data usage is OK for model training. But until that happens, I think large organizations are very skittish about making public statements about the sources of their data.

McCarty Carino: Do you think we might see more open-source models like LLaMA or more of the closed models like ChatGPT?

Goldstein: I think we’re probably going to see fewer and fewer models released in general, just because as these models get better, the bar gets pushed up, and it takes longer to try to exceed that bar. So the LLaMA 2 model, that is actually a pretty incredible effort and it will be difficult for anyone to exceed that effort even with industrial resources, it will be difficult to exceed the quality of LLaMA 2. And with open-source community resources, it will be very difficult to exceed LLaMA 2. Over the last few months, we’ve actually seen a lot of new language models drop. Their development probably started months ago, when they were anticipating that their model would be state of the art. And then, by the time it dropped, it kind of landed with a thud. People just weren’t that excited about it because there were already great open-source models available. And I think we’re going to see the emergence of more closed-source models, for sure, over the next few months, and you’ll probably see other large American companies that want to do it. I know there are definitely large data providers in Asia that are interested in adding [large language models] to their own platform. And so I think we’re going to continue to see some development in terms of proprietary LLMs. I have a feeling that the rate at which we see these open-source LLMs is going to slow down.

More on this

We framed this discussion as open versus closed AI models, but in reality, it’s almost never just one or the other, as a recent article from Wired laid out.

Author Irene Solaiman of the AI company Hugging Face says popular models like ChatGPT and Midjourney are actually sort of in the middle of the spectrum. They’re broadly accessible to the public, which does allow for more feedback than models that require special access or skills to use.

But these popular models don’t make it possible for outsiders to really study how they work. On these assumptions, Meta’s LLaMa 2 would be further down the openness axis but still not all the way as it hasn’t revealed its training data.

As we noted with Tom Goldstein, training data has become an increasingly contentious issue, with a number of artists, authors and organizations suing AI companies for allegedly using their copyrighted works to train models without consent.

AI companies have argued that their use of those works is protected under the fair use doctrine, which allows the use or remixing of copyrighted material in the service of creating something new. But a lot of companies have still been pretty cagey about what data they actually use.

The European Union could require more disclosure of training data under a new, comprehensive AI law making its way through the legislative process.

But as Time magazine reported this summer, big players like OpenAI have lobbied hard, and often successfully, to water down such requirements.

Stories You Might Like

All you want for Christmas is … gift cards?

01/12/16: Meet the real-life R2-D2

02/09/2018: Can we blame algorithms for market volatility?

11/24/2017: Is it OK to be mean to your digital assistant?

Deep thoughts about deepfakes

What does it mean to develop trustworthy AI?

The future of this podcast starts with you.

Every day, the “Marketplace Tech” team demystifies the digital economy with stories that explore more than just Big Tech. We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.

As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.

Support “Marketplace Tech” in any amount today and become a partner in our mission.