The use of speech recognition to interact with computers is becoming more and more popular. Voice control can be used with an increasing number of online services; such as ordering food or scheduling a ride. However, little is known about how these systems respond to malicious and sneaky attacks. It is possible to control these systems with noise that is unintelligible to humans, but they would still hear a noise.
The researchers Zhang et al. took this a step further by developing an attack that is inaudible to humans; which they called DolphinAttack. They were able to initiate a FaceTime call on iPhone, activate Google Now to switch the phone to airplane mode and even manipulate the navigation system in an Audi automobile without being heard by a person.
Humans can hear sounds in the range between the frequencies of 20Hz and 20,000 Hz. Any sounds outside of this range are inaudible to people. Unsurprisingly, audio equipment normally filters out sounds outside of this range. To overcome the filtering the researchers exploited a property of sound known as ‘non-linear effects’ in order to produce harmonics within the expected frequency at the microphone from loud but inaudible sounds.
Using this technique, they were able to both activate the listening mode and provide instruction by producing harmonics from snippets of the device owners voice or by brute forcing voice tonalities to overcome voice pattern matching authentication.
In an experiment to test the feasibility of this kind of attack they used a specialised speaker placed at a distance of less than 2 metres from the device. The effectiveness of the attacks were reduced in environments where there was greater background noise, or with greater distances or lower attack volumes. They validated DolphinAttack in multiple languages on 7 popular speech recognition systems (e.g., Siri, Google Now, Alexa) and across 16 common voice controllable system platforms. The researchers recommend both hardware-based and software-based defense strategies to mitigate attacks, such as limiting the operating range of microphones and better detecting modulated voice commands.
It is possible to generate voice commands for services like Siri that humans cannot hear to do things on the device.