Google is turning on AI-powered noise cancellation in Google Meet today. Like Microsoft Teams’ upcoming noise suppression functionality, the feature leverages supervised learning, which entails training an AI model on a labeled data set. This is a gradual rollout, so if you are a G Suite customer you may not get noise cancellation until later this month. Noise cancellation will hit the web first, with Android and iOS coming later.
In April, Google announced that Meet’s noise cancellation feature was coming to G Suite Enterprise and G Suite Enterprise for Education customers. Here’s how the company described it: “To help limit interruptions to your meeting, Meet can now intelligently filter out background distractions — like your dog barking or keystrokes as you take meeting notes.” The “denoiser,” as its colloquially known, is on by default, though you can turn it off in Google Meet’s settings.
The use of collaboration and video conferencing tools has exploded as the coronavirus crisis forces millions to learn and work from home. Google is one of many companies trying to one-up Zoom, which saw its daily meeting participants soar from 10 million to over 200 million in three months. Google is positioning Meet, which has 100 million daily meeting participants as of April, as the G Suite alternative to Zoom for businesses and consumers alike.
Serge Lachapelle, G Suite director of product management, has been working on video conferencing for 25 years, 13 of those at Google. As most of the company shifted to working from home, Lachapelle’s team got the go-head to deploy the denoiser in Google Meet meetings. We discussed how the project started, how his team built noise cancellation, the data required, the AI model, how the denoiser works, what noise it cancels out and what it doesn’t, privacy, and user experience considerations (there is no visual indication that the denoiser is on).
Starting in 2017
When Google rolls out big new features, it typically starts with a small percentage of users and then ramps up the rollout based on the results. Noise cancellation will be no different. “We plan on doing this gradually over the month of June,” Lachapelle said. “But we have been using it a lot within Google over the past year, actually.”
The project goes back further than that, beginning with Google’s acquisition of Limes Audio in January 2017. “With this acquisition, we got some amazing audio experts into our Stockholm office,” Lachapelle said.
The original noise cancellation idea was born out of annoyances while conducting meetings across time zones.
“It started off as a project from our conference rooms,” Lachapelle said. “I’m based out of Stockholm. When we meet with the U.S., it’s usually around this time [morning in the U.S., evening in Europe]. You’ll hear a lot of cling, cling, cling and weird little noises of people eating their breakfast or eating their dinners or taking late meetings at home and kids screaming and all. It was really that that triggered off this project about a year and a half ago.”
The team did a lot of work finding the right data, building AI models, and addressing latency. But the biggest obstacle was forming the idea in the first place, followed by multiple simulations and evaluations.
“It had never been done,” Lachapelle said. “At first, we thought we would require hardware for this, dedicated machine learning hardware chips. It was a very small project. Like how we do things at Google is usually things start very small. I venture a guess to say this started in the fall of 2018. It probably took a month or two or three to build a compelling prototype.”
“And then you get the team excited around it,” he continued. “Then you get your leadership excited around it. Then you get it funded to start exploring this more in depth. And then you start bringing it into a product phase. Since a lot of this has never been done, it can take a year to get things rolled out. We started rolling it out to the company more broadly, I would say around December, January. When people started working at home, at Google, the use of it increased a lot. And then we got a good confirmation that ‘Wow, we’ve got something here. Let’s go.’”
Corpus data
Similar to speech recognition, which requires figuring out what is speech and what is not, this type of feature requires training a machine learning model to understand the difference between noise and speech, and then keep just the speech. At first, the team used thousands of its own meetings to train the model. “We’d say ‘Okay everyone, just so you know we’re recording this, and we’re going to submit it to start training the model.’” The company also relied on audio from YouTube videos “wherever there’s a lot of people talking. So either groups in the same room or back and forth.”
“The algorithm was trained using a mixed data set featuring noise and clean speech,” Lachapelle said. Other Google employees, including from the Google Brain team and the Google Research team, also contributed, though not with audio from their meetings. “The algorithm was not trained on internal recordings, but instead employees submitted feedback extensively about their experiences, which allowed the team to optimize. It is important to say that this project stands on the shoulders of giants. Speech recognition and enhancement has been heavily invested in at Google over the years and much of this work has been reused.”
Nevertheless, a lot of manual validation was still required. “I’ve seen everything from engineers coming to work with maracas, guitars, and accordions to just normal YouTubers doing livestreaming and testing it out on that. The range has been pretty broad.”
The denoiser in action
The feature may be called “noise cancellation,” but that doesn’t mean it cancels all noise. First off, it’s difficult for everyone to agree on what sounds constitute noise. And even if most humans can agree that something is an unwanted noise in a meeting, it’s not easy to get an AI model to concur without overdoing it.
“It works well on a door slamming,” Lachapelle said. “It works well on dogs barking, kids fighting — so-so. We’re taking a softer approach at first, or sometimes we’re not going to cancel everything because we don’t want to go overboard and start canceling things out that shouldn’t be canceled. Sometimes it’s good for you to hear that I’m taking a deep breath, or those more natural noises. So this is going to be a project that’s going to go on for many years as we tune it to become better and better and better.”
On our call, Lachapelle demonstrated a few examples of the feature in action. He knocked a pen around inside a mug, tapped on a can, rustled a plastic bag, and even applauded. Then he did it all again after turning on the denoiser — it worked. You can watch him recreate similar noises (rustling a roasted nut bag, clicking a pen, hitting an Allen key in a glass, snapping a ruler, clapping) in the video up top.
“The applause part was a kind of a strange moment because when we did our first demo of this to the whole team, people broke out in applause and it canceled out the applause,” Lachapelle said. “That’s when we understood, ‘Oh, we’re going to need to have a controller to turn this on and off in the settings because there’s probably going to be some use cases where you really don’t want your noise to be removed.’”
Vocal ranges
The line for what the denoiser does and doesn’t cancel out is blurry. It’s not as simple as detecting human voices and negating everything else.
“The human voice has such a large range,” Lachapelle said. “I would say screaming is a tough one. This is a human voice, but it’s noise. Dogs at certain pitches, that’s also very hard. So some of it sometimes will slip through. On those kinds of things, it’s still a work in progress.”
“Things like vacuum cleaners, we’ve got down really well,” he continued. “I had a big customer meeting the other day with Christina, who’s in Zurich — she leads our support team. And so we were talking with this customer, and all of a sudden I see in the back, her Roomba starts rolling into the room and gets stuck under her desk. She was there trying to talk to the customer and getting rid of the Roomba, and we never heard the Roomba go. It was completely silent. I thought that was kind of the ultimate test. If we can get those kinds of things out — drills, people that have construction next door, people that are sitting in the kitchen and they’ve got the blender going — those kinds of things it’s really, really good at.”
A musical instrument will probably also get filtered out. “To a pretty large degree, it does,” Lachapelle said. “Especially percussion instruments. Sometimes a guitar can sound very much like a voice — you’re starting to touch the limits there. But if you have music playing in the background, usually it’ll cut it all out.”
What about laughter? “I’ve never heard it block laughter.”
What about singing? “Singing works.”
Singing goes through, but the musical instruments don’t, “especially if they’re in the background.”
Crucially, Google Meet’s noise cancellation is being rolled out for all languages. That might seem obvious at first, but Lachapelle said the team discovered it was “super important” to test the system on multiple languages.
“When we speak English, there’s a certain range of voice we use,” Lachapelle said. “There’s a certain way of delivering the consonants and the vowels compared to other languages. So those are big considerations. We did a lot of validation across different languages. We tested this a lot.”
Proximity and amplitude
Another challenge was dealing with proximity. This is not a machine learning problem — it’s a “too much noise too close to the microphone” problem.
“Keyboard typing is tricky,” Lachapelle said. “It’s like a step function in the audio signal. Especially if the keyboard is close to the microphone, that bang of the key right next to the microphone means that we can’t get voice out of the microphone because the microphone got saturated by the keyboard. So there are cases where if I’m overloading the microphone, my voice can’t get through. It becomes more or less impossible.”
The team factored in distance from the microphone when determining what to filter out. The model thus adapts for amplitude. On our call, Lachapelle played some music from his iPhone. When he put his phone’s speakers right next to the microphone, we could hear the music come through a little bit while his voice, which was coming from further away, distorted a bit. Google Meet did not cancel out the music completely — it was more muffled. When he turned off the denoiser, the music came through at full volume.
“That’s when you see it find that threshold that we were talking about,” Lachapelle said. “You don’t want to have false positives, so we will err on the side of safety. It’s better to let something go through than to block something that really should go through. That’s what we’re going to start tuning now, once we start releasing this to more and more users. We’ll be able to get a lot of feedback on it. Someone out there is going to have a scenario we didn’t think of, and we’ll have to take that into consideration and further the model.”
Tuning
Tuning the AI model is going to be difficult, given all the different types of noise it encompasses. But the end goal isn’t to get the model to cancel out background noise completely. Nor is it making sure that all types of laughter can get through 100%.
“The goal is to make the conversation better,” Lachapelle said. “So the goal is the intelligibility of what you and I are saying — absolutely. And if the music is playing in the background and we can’t cancel it all out, as long as you and I can have a better conversation with it turned on, then it’s a win. So it’s always about you and I being able to understand each other better.”
Making the conversation more coherent is particularly important in the era of smartphones and people working on the go.
“We have a big chunk of users now that are using mobiles, and we’ve never seen this much mobile usage, percentage-wise,” Lachapelle said. “I know we all talk about billions of minutes and so on going on in the system. But of that big chunk, the percentage of mobile users has never been this high. And mobile users are usually in very noisy environments. So for that use case, it’s going to have a huge impact. Here I’m sitting in my little office in Sweden with my fancy mic and my good headphones, probably not what we designed this for. We designed this for noisy environments because people need to talk wherever they are.”
Privacy
When you’re on a Google Meet call, your voice is sent from your device to a Google datacenter, where it goes through the machine learning model on the TPU, gets reencrypted, and is then sent back to the meeting. (Media is always encrypted during transport, even when moving within Google’s own networks, computers, and datacenters. There are two exceptions: when you call in on a traditional phone, and when a meeting is recorded.)
“In the case of denoising, the data is read by the denoiser using the key that is shared between all the participants, denoised, and then sent off using the same key,” Lachapelle said. “This is done in a secure service (we call this borg) in our datacenter, and the data is never accessible outside the denoiser process, in order to ensure privacy, confidentiality, and safety. We’re still working on the plumbing in our infrastructure to connect the people that dial in with a phone normally. But that’s going to come a little bit later because they are a very noisy bunch.”
Lachapelle emphasized repeatedly that Google will be improving the feature over time, but not directly using external meetings. Recorded meetings will not be used to train the AI either.
“We don’t look at anything that’s going on in the meetings, unless you decide to record a meeting,” Lachapelle said. “Then, of course, we take the meeting and we put it to Google Drive. So the way we’re going to work is through our customer channels and support and so on and trying to identify cases where things did not work as predicted. Internally at Google, there are meetings that are recorded, and if someone identifies a problem that happened, then hopefully they’ll send it to the team. But we don’t look at recordings for this purpose, unless someone sends us the file manually.”
User experience considerations
If you’re a G Suite enterprise customer, when Google flips the switch for you this month Meet’s noise cancellation feature will be on by default. You will have to turn it off in settings when you want “noise” to come through. On the web, you’ll click the three dots at the bottom right, then Settings. Under the Audio tab, between microphone and speakers, you’ll see an extra switch that you can turn on or off. It’s labeled “Noise cancellation: Filters out sound that isn’t speech.”
Google decided to put this switch in settings, as opposed to somewhere visible during a call. And there is no visual indication that noise is being canceled out. This means noise will be canceled out on calls and people won’t even be aware it’s happening, let alone that the feature exists. We asked Lachapelle why those decisions were made.
“There’s some people that would perhaps want us to show like ‘Look at how good we are. Right now your noise is being filtered out.’ I guess you could bring it down to user interface considerations,” Lachapelle said. “We’ve done a lot of user testing and interviews of users. We had users in labs last year before confinement, where we tested different models on them. And that combined with — you can see Meet doesn’t have buttons all over the place, it’s a fairly clean UX. Basically, my answer to your question would be, it’s based on the user research we’ve done, and on trying to keep the interface of Meet as clean as possible.”
Who controls the noise cancellation?
On a typical Google Meet call, you can mute yourself and — depending on the settings — mute others. But Google chose to not let users noise-cancel others. The noise cancellation occurs on the sender’s side — where the noise originates — so that’s where the switch is. While that might make sense in most cases, it means the receiver cannot control noise cancellation for what they hear. The team made that decision deliberately, but it wasn’t an easy one.
“I don’t think the off switch is going to be used much at all,” Lachapelle said. “So putting it front and center might be sort of overloading it. This should just be magic and work in the background. But like again, your ideas are spot on. This is exactly what we’ve been talking about. We’ve been testing. So it really shows that you’ve done a lot of homework on this. Because these are the challenges. And I don’t think any of us is 100% sure that this is the right way. Let’s see how it goes.”
If it doesn’t work out, that’s okay. Google has already done the majority of the work. Moving switches around — “I don’t want to say that it’s simple, but it’s simpler than changing the whole machine learning model.” We asked whether alternative solutions could mean having the switch on the receiving end, or even on both ends.
“So we’ll try with this, and we might want to move to what you’re describing, as we get this into the hands of more and more users,” Lachapelle said. “By no means is this work done. This is going to be work that’s going to go on for a while. Also, we’re going to learn a lot of things. Like what controls are the best for the users. How do you make users understand that this is going on? Do they need to understand that this is going on? We think we have an idea of how to get the first step, but beyond that it’ll be a journey with all of our users.”
If the current solution doesn’t work, Lachapelle said the team will probably build a few prototypes, do some more user research, and test them out via G Suite’s alpha program.
Cloud versus edge
Google also made a conscious decision to put the machine learning model in the cloud, which wasn’t the immediately obvious choice.
“There’s a lot of ways to apply these models,” Lachapelle said. “Some require much beefier endpoints — you need a good computer. You’ve seen some of the stuff that that has been released, some of it as an extension or some of it requires a more powerful graphics card. We didn’t want to go that way. We wanted to make sure that access to this would be possible on your phones, no matter what phone you have, on your laptops. Laptops are getting thinner — they don’t have fans anymore. Loading them too hard with CPU isn’t a good idea. So we decided to see if we could do this in the cloud.”
Using the cloud simply wasn’t feasible before.
“Manipulating media in the cloud, just five, six, seven years ago could add 200 milliseconds delay, 300 milliseconds delay,” Lachapelle said. “Our job has always been passing through the cloud as quickly as possible. But now with these TensorFlow processors, and basically the way that our infrastructure is built, we discovered that we could do media manipulation in real time and add sometimes only around 20 milliseconds of delay. So that’s the road we took.”
Google did consider using the edge — putting the machine learning model on the actual device, say in the Google Meet app for Android and iOS.
“Of course we thought of it,” Lachapelle said. “But we decided that we wanted to have a more consistent experience across devices. Let’s say that I have an advanced i9 processor and then I get to use [noise cancellation]. But then if I move to my laptop that only has an i3 processor, my voice is so much worse. And so we really tried to see how can we bring this to a large group of people in a consistent way. It’s been about the consistency of the experience.”
Google’s decision to use the cloud means you should have the exact same denoised meeting experience on every device. You won’t have to update anything either, not even the Google Meet app on your phone. Noise cancellation will be turned on server-side.
“We really think it’s going to help out a lot,” Lachapelle said. “I’ve worked on echo cancellation, on cleaning up video artifacts in real time, all these things. And this is the first time we can do our signal processing in the cloud. We’re quite excited about it. I think that this can change a lot of the signal processing paradigms. Whereas it used to be very, very complex math, and math that is often limited by the hardware you have — using machine learning models in the cloud instead of the complex math to achieve the same, or better, results.”
Speed and cost
In addition to training the model on different types of noise, there was another big technical hurdle to overcome: speed.
“Doing this without slowing things down is so important because that’s basically what a big chunk of our team does — try to optimize everything for speed, all the time,” Lachapelle said. “We can’t introduce features that slow things down. And so I would say that just optimizing the code so that it becomes as fast as possible is probably more than half of the work. More than creating the model, more than the whole machine learning part. It’s just like optimize, optimize, optimize. That’s been the hardest hurdle.”
Google seems happy with the latency, but there is also a question of cost. It’s expensive to add an extra processing step for every single attendee in every single meeting hosted in Google Cloud.
“There’s a cost associated with it,” Lachapelle acknowledged. “Absolutely. But in our modeling, we felt that this just moves the needle so much that this is something we need to do. And it’s a feature that we will be bringing at first to our paying G Suite customers. As we see how much it’s being used and we continue to improve it, hopefully we’ll be able to bring it to a larger and larger group of users.”
Source: Read Full Article