Microsoft this week announced it will acquire Nuance, a Boston-based speech recognition and artificial intelligence company, for around $16 billion. It’s the company’s largest acquisition after LinkedIn and a big bet on speech recognition technology.
Nuance is used most in health care. About 10,000 health care facilities worldwide use it to capture conversations between patients and doctors and transcribe them in real time. I spoke with Daniel Hong, a research director at Forrester. He told me that a controlled environment like a clinic or doctor’s office can make the tech more accurate. The following is an edited transcript of our conversation.
Daniel Hong: In medicine, it’s pretty finite, the way people will express themselves about pain, about discomfort, and how the doctor can also talk about prescriptions. So if you have something that’s a bit more finite, the accuracy rates actually are higher.
Molly Wood: It sounds like you’re describing kind of two paths for interaction with voice recognition technologies. And one is sort of this high-accuracy, maybe low-flexibility, interaction, right? Where you’re using a lot of bank terms, using a lot of medical terms. And then, the other is this sort of free-floating, “I’m going to say whatever I want to Google.”
Hong: The Googles and the Amazons and Apples are actually pretty good with general questions. But now we’re seeing a little bit more complexity in terms of the questions that you can ask. So you can ask, “How old was Joe Biden’s wife when he became president?” Another area is context, having that conversation with Google Homes and Amazon Alexas. “How’s the weather in Florence? What about Paris? What about the temperature in these areas in Celsius?” And that’s an area where we’ve seen quite a bit of improvement over the last few years. But there’s still a long way to go in order to have that natural dialogue and full conversation with a speech recognition system.
Wood: So, of course, Google, Amazon and Apple have been in this space for a while now, from the consumer perspective, of course. But how far ahead are they in the competition?
Hong: When it comes to speech, it’s really about how much data you have, how many utterances you’re capturing, from people that are engaging that speech application. And when you’re looking at the whole consumerism of speech, and you have all these devices, from the smartphones to the smart speakers, it’s just a lot more data. So if they’re capturing all that data, and they’re using machine learning, deep neural nets, to constantly improve and refine the accuracy and understanding what the consumer is trying to say, then they are kind of light-years ahead of others.
Wood: What do you think are the biggest growth areas you see for speech recognition? I mean, I’m surprised at how many people I know who still don’t just dictate every single thing to their phone like I do.
Hong: Yeah, I think we’re just gonna see the application of speech being used more on a day-to-day basis to conduct tasks. So I think, as a consumer in the home, you’re able to operate lights, you’re able to use that as command-and-control features of searching for things on your television, to getting alerts and actually talking to some of your smart refrigerators, talking to your car and interacting in that way, changing the temperature control, [and] so forth. It’s essentially a user interface. We’ve been so used to just typing or turning things on and off, [that] we’ll be more and more inclined to use voice as the technology improves. And actually, there are a lot more microphones in the house to capture all that. In addition, I think that there will be a lot more innovation when it comes to the use of voice in the future, of using voice as a good way to interact, like coaching or having an assistant there, like your own personal assistant [and] being able to kind of interact with you via voice. There’s also things in health care, like with health checks [and] being able to diagnose just through the use of voice. I think we’re really on the cusp of voice being able to have a lot more innovative use cases. It’s just we need to get the accuracy there. And we have to essentially get more devices with voice interfaces out there to consumers.
Wood: Certainly, the reaction that is inevitable when you say “more microphones” is a question of privacy, especially if we start using this in doctors’ offices. Is there, do you think, a corresponding effort to do on device processing or make sure that recordings aren’t kept? Or is that still pretty much Wild West?
Hong: I think it’s a bit of both. I think because of [personal health information] and HIPAA, there’s a lot of compliance [required] in health care. So a lot of these systems and hospitals, and even those being used by health care insurers, have to meet these very strict guidelines to be able to have the technology rolled out to their patients and to their members. That said, it’ll have to evolve over time. You should probably let the patient know that this is being recorded and they have to have the opt-in capability. All these things will be ironed out as speech becomes more mainstream in health care and in other areas of the consumer’s world.
Wood: In terms of accuracy and the ability to respond to a query, ultimately, is that driven by the database on the back end as much as the actual speech recognition? Like, I’m thinking of how Alexa doesn’t use Google, but Google does. And as a result, I can say to Google, “What’s that movie about the kids in high school that has Matthew McConaughey in it?” And Google will know that, but Alexa won’t. And I just wonder how much of that is driven by the sources they’ve chosen to pull from.
Hong: It is. It’s that. I mean, if you break it down, you’re going to break down what is being said, is one part. And the other part would be, what is this person asking? So there’s kind of two different parts there. Once you get that, then you can identify what this person is asking. And then, you have to match that to the information that they’re looking for or whatever it is that they want to get done. And Google has Google search results. Google has a lot of information that’s synced to the Google speech experience. So I think that could be a reason why you’ll get perhaps more answers on that Google front than some of your Alexa devices.
Related links: More insight from Molly Wood
Microsoft’s digital assistant Cortana is embedded in Windows 10, but Hong told us that several big companies license Cortana for speech-to-text customer service.
Computer Weekly has a nice piece about Microsoft’s possible goals, and to be clear, those definitely include getting a bigger foothold in the health care space. Last September, Microsoft announced its first-ever cloud product for a single industry, Microsoft Cloud for Healthcare, which aims to make it easier for providers to manage care, appointments, patient data, monitoring and analytics. And it seems like its direction was heavily influenced by being announced in the middle of a pandemic.
Nuance was actually a partner already in that product, so in some ways the acquisition is a natural evolution. But, of course, its technology will have applications in other parts of the business because Cortana, while perhaps quietly being licensed for use in big enterprises, isn’t exactly the household name for people (other than Halo players) that Alexa, Siri and Hey Google are.
And if that set off all three of your devices at once, for those listening to this episode out loud, I apologize. But also, maybe pit them against each other with a question about a movie or something. Record it and send us the results: email@example.com.
The future of this podcast starts with you.
Every day, Molly Wood and the “Tech” team demystify the digital economy with stories that explore more than just “Big Tech.” We’re committed to covering topics that matter to you and the world around us, diving deep into how technology intersects with climate change, inequity, and disinformation.
As part of a nonprofit newsroom, we’re counting on listeners like you to keep this public service paywall-free and available to all.