Vor anderthalb Jahren bloggte ich über Nicholas Carlinis Voice-Command-Hack, in dem er Stimmkommandos in White Noise versteckte und damit ein iPhone kontrollieren konnte.
Jetzt hat er seine Stimmbefehl-Maskierungs-Technik so aufgebohrt, dass er nicht weniger als 50 Zeichen pro Sekunde in Songs und anderen Audiosignalen verstecken kann, indem er die Waveform minimal aber für Maschinen erkennbar manipuliert (Bild oben) und so Befehle mitschickt. Impressive. Vor allem, wenn man die gehackten Audiosignale mit Googles Duplex Natural Voice Schnickschnack zusammendenkt.
A group of students from University of California, Berkeley, and Georgetown University showed in 2016 that they could hide commands in white noise played over loudspeakers and through YouTube videos to get smart devices to turn on airplane mode or open a website.
This month, some of those Berkeley researchers published a research paper that went further, saying they could embed commands directly into recordings of music or spoken text. So while a human listener hears someone talking or an orchestra playing, Amazon’s Echo speaker might hear an instruction to add something to your shopping list. […]
With audio attacks, the researchers are exploiting the gap between human and machine speech recognition. Speech recognition systems typically translate each sound to a letter, eventually compiling those into words and phrases. By making slight changes to audio files, researchers were able to cancel out the sound that the speech recognition system was supposed to hear and replace it with a sound that would be transcribed differently by machines while being nearly undetectable to the human ear.
We construct targeted audio adversarial examples on automatic speech recognition. Given any audio waveform, we can produce another that is over 99.9% similar, but transcribes as any phrase we choose (recognizing up to 50 characters per second of audio). We apply our white-box iterative optimization-based attack to Mozilla’s implementation DeepSpeech end-to-end, and show it has a 100% success rate.
The feasibility of this attack introduce a new domain to study adversarial examples.
We demonstrate targeted audio adversarial examples are effective on automatic speech recognition. With optimization-based attacks applied end-to-end, we are able to turn any audio waveform into any target transcription with 100% success by only adding a slight distortion. We can cause audio to transcribe up to 50 characters per second (the theoretical maximum), cause music to transcribe as arbitrary speech, and hide speech from being transcribed […].
Can these attacks be played over-the-air? Image-based adversarial examples have been shown to be feasible in the physical world. In the audio space, both hidden voice commands and Dolphin Attack’s inaudible voice commands are effective over-the-air when played by a speaker and recorded by a microphone.