Defeating Spam Callers With Speech Recognition and Call Forwarding

Every day at 11:00 AM my iPhone rings. With the “seamless” “magic” of Apple’s Continuity, the ringing quickly spreads to my personal MacBook, work MacBook, iPad, and Apple Watch.

“This is your last warning regarding your car insurance…”

I don’t own a car. I wouldn’t mind if this happened once or twice a month, but I’m receiving upwards of two spam calls per day. It holds my devices hostage until I decline the call.

I first tried the most obvious solution: caller ID-based blocking. Apple introduced CallKit in iOS 10, allowing apps to blacklist up to 2,000,000 numbers^[1]. I downloaded the two most popular call blocking apps on the App Store to test the method’s efficacy: Truecaller and Mr. Number. However, neither could identify (much less block) more than 10% of the spam calls I previously received^[2]. Quite disappointing.

Since using commercial off-the-shelf solutions wasn’t an option, I had to look elsewhere. A few days ago I started working on my first attempt to stop the deluge of spam calls. I call it Carbon Call.

Carbon Call

The system acts as a kind of virtual receptionist. It plays a pre-recorded prompt phrase for callers and then validates their answer using speech recognition on a set of keywords. For example:

Prompt: “Please provide the name of the person you are trying to reach.”
Valid keywords: “Andrew” and “Chidden”

Adjusting the prompt and keywords allows the system to block anything from only robocalls to people who dislike Star Trek. For callers providing caller ID, the system automatically white-lists their number if they pass the test^[3].

Once the system verifies that the caller isn’t a spammer, my phone rings and I’m connected in under ten seconds. If I’m not available to answer the call, Carbon Call also handles voicemail recording and notifications.

It’s quite simple but works remarkably well. In the past two days I haven’t received a single spam call. I put together a short audio demo showing the call flow^[4]:

Features

Carbon Call needed to meet very specific requirements which shaped how I designed the system. These are the current marquee features:

Uses my existing phone number. Actual, non-spammer humans call my current number, so switching to a new number isn’t an option.
Free. There’s some utility in blocking spam calls, but I want to avoid paying another monthly subscription and calling rate. Carbon Call costs nothing per month or per minute^[5].
Works without cellular data. I’m usually inside 4G or LTE coverage and have high monthly data limits. However, the quality of service for Voice over IP (VoIP) can degrade on 2G and 3G connections^[6].
Low false positive and negative rate. This is a given for any effective spam-filtering system. Carbon Call filters out robocalls and general spam without blocking legitimate callers.
Integrates with the native iOS phone app. I don’t want to use or maintain a 3rd party call history interface. Calls received behind Carbon Call show up in the phone app’s Recents tab.

Getting Carbon Call

For the proof of concept, I specifically designed the system to accommodate a single user: me. Depending on how much interest there is in Carbon Call, I’ll consider either expanding and optimizing the project or just open sourcing it.

I’ll make a post on this site when I come to a conclusion on how to proceed. I suggest subscribing, or you can just check back in a few weeks.

Technical Implementation

In terms of external dependencies, Carbon Call runs on a 1 GB Linode box (referral link), a single Google Voice account, and the Google Cloud Platform Speech API (GCP).

Linode offers more memory per dollar than Digital Ocean (1 GB vs 512 MB). However, I much prefer Digital Ocean’s control panel design.
Google Voice hides a lot of the complexities inherent in VoIP, which is good for development but not for commercial or enterprise use^[7]. That said, Google Voice offers free domestic calls in the United States along with iOS and web client integrations. Why pay for SIP trunking and a virtual number when you can just sell your soul to Google’s ad machine? :)
Google Cloud Platform Speech API gives decently accurate results without much work. The relatively short and infrequent recordings make the free tier offered by GCP more than adequate^[8].

To understand the architecture, we can trace the path of a Carbon Call-protected conversation between Alice and Bob. Alice is calling Bob’s iPhone.

Alice dials Bob’s phone number. Bob has enabled call forwarding on his iPhone so that all incoming calls forward to his Google Voice number.
The Carbon Call server continuously monitors for incoming calls on Bob’s Google Voice account and answers the forwarded call from Alice.
Alice hears a greeting and Bob’s prompt. If Alice responds incorrectly to the prompt twice then the server kicks her out of the call.
Assuming Alice responds correctly to Bob’s prompt, the Carbon Call server sends a push notification to Bob’s iPhone which rings and prompts Bob to join the ongoing call by dialing his Google Voice number.
Assuming Bob joins the call, the Carbon Call server merges both Alice and Bob’s audio sessions into a single call.

Let’s look at the implementation of each step:

1. Call Forwarding

Call forwarding from the iPhone to Google Voice allows Bob to retain his old number. Most VoIP services do call forwarding in the reverse direction, forwarding calls received by the VoIP service to your iPhone. For iPhones on GSM networks^[9], iOS conveniently has a graphical interface in Settings > Phone > Call Forwarding.

When call forwarding is enabled, iOS displays a statusbar icon of a phone with a right-facing arrow.

2. Call Monitoring

Once the call arrives at Google Voice, it propagates to the various audio streaming clients, namely Google Talk and Hangouts. The Carbon Call server uses Selenium, a web automation framework, to headlessly run instances of Hangouts in Google Chrome.

This was the first time I used Selenium, but I found it surprisingly easy to get everything working^[10]. I did encounter a few issues though:

Google Accounts unsurprisingly thought someone hacked my account when I logged in from the server. I needed to install a graphical desktop environment just to get through all of Google’s security prompts.
Google Chrome leaks memory when continuously run with Selenium’s driver over long periods. I observed roughly 20 MB per hour even when forcing Python’s garbage collection. I’ve resorted to auto-restarting the main Chrome driver every so often to ensure that the server doesn’t hang.

3. Speech Recognition

The server plays pre-recorded voice prompts and then records the caller’s response. For speech synthesis, I’m using snippets recorded from Google Translate’s speech output. For speech recognition, I simply send the recordings to GCP’s Speech API to process and return a transcript. A few things I discovered with the Speech API:

Silence padding is required for results. I originally trimmed the leading and trailing silences with sox but wasn’t getting anything back from GCP.
Accuracy isn’t always the greatest, possibly due to compression over the phone network or little context. The Speech API seems to perform better for longer recordings where it can analyze an entire sentence instead of just a few words. To compensate for poor accuracy, I added phonetic variations to the keywords being matched.

4. Call Joining

Since Bob enabled call forwarding, it’s not possible to directly call his number. The server could bypass Bob’s forwarding rules by initiating a new Hangouts VoIP call, but I want all connections to route over cellular voice.

Instead, Carbon Call uses a custom iOS client able to receive remote push notifications. When Alice passes the tests on the server, Bob receives a push notification causing his iPhone to play a custom vibration pattern and ringtone for thirty seconds.

To accept Alice’s call, the app prompts Bob to call his Google Voice number with a single tap. iOS displays the alert regardless of where Bob is in iOS, including the lock screen.

5. Audio Routing

When Bob calls his Google Voice number, the server needs to create a conference call between Alice and Bob’s separate lines. Conference calling should be trivial and accomplished with a built-in option, but Google Voice no longer supports this feature for web-based Hangouts clients^[11].

Instead, the server initiates two separate Chrome instances (one for Alice and one for Bob), and uses a set of PulseAudio modules to route the audio between them.

User Experience

There’s a lot that goes on between Alice dialing Bob’s number and Bob joining the call. While I spent the first day on research and development, the second day was spent on streamlining the experience. Small things such as adding two attempts to the caller validation helped tremendously in smoothing out interactions.

Hold Music

After the system verifies that Alice isn’t a spammer (either through Bob’s prompt or via whitelist), Alice needs to wait a moment before Bob connects. Playing some kind of hold music to keep Alice on the line is an obvious solution, but the specific type of music makes a difference.

Due to compression along the phone network and conversion process from digital to analogue, most music sounds utterly horrible. The Strauss Horn Concerto No. 2 becomes the Harmonica Concerto No. 2, and Festive Overture becomes decidedly not festive.

Phone audio compression algorithms optimize for the voice frequency band (300 to 3400 HZ), but music usually covers a much larger range of frequencies. After playing a dozen different tracks over the phone, I finally settled on a modified version of Resignation by Kevin MacLeod. It’s a slow solo piano piece in the middle of the frequency range, making it a good fit for hold music:

Using Audacity’s frequency analysis tool, we can see the difference between the first thirty seconds of Festive Overture and Resignation's frequency ranges:

80% of Resignation falls between 300 and 3400 HZ, with minor clipping expected in the bass range (< 300 HZ). In contrast, around 80% of Festive Overture will be clipped or heavily distorted starting from the upper mid-range. My version of Resignation still doesn’t sound the greatest when played over the phone, but it doesn’t make your ears bleed either.

Caveats

It’s not all milk and honey in the walled garden of Carbon Call. There’s still a few challenges with the current version of the system:

High latency for calls due to Google Voice’s VoIP implementation. On average, I’m seeing roughly 800ms between speaking into one phone and hearing it on the other. For reference, 150ms is widely considered the maximum acceptable latency for real-time voice applications. In my testing, the latency wasn’t horrible, but certainly less than ideal^[12].
High memory requirements to run multiple headless Chrome instances makes it difficult to scale the setup. The box sees about 80% memory pressure during active calls. Adding swap memory isn’t sufficient to maintain adequate performance either.
Brittleness due to scripting Selenium on the DOM structure. If Google substantially changes the Google Accounts login flow, Hangouts, or the call frame, then things will start breaking in unexpected ways.

Solving these problems require building or purchasing access to a custom VoIP network. It’d be a lot of work and essentially a complete re-write. For the time being, I’m satisfied with the current setup as a proof of concept. ♦

You can add a new blocked number in a CXCallDirectoryProvider by calling addBlockingEntryWithNextSequentialPhoneNumber: 2,000,000 times before iOS displays an error. However, attempting to save a smaller (but still large) number of entries (e.g. above 1,000,000) causes iOS to show an indefinite activity spinner when enabling the extension. Subsequent attempts to enable or disable the extension silently fail on iOS 11.2, even after restarting the device and re-installing the app. ↩︎
I’m guessing most of the numbers were from spoofed caller IDs, making these kinds of apps ineffective for me. ↩︎
Spammers can spoof caller IDs, but it’s unlikely that they would spoof one of my contacts. Preventing targeted attacks lies outside the scope of the project. ↩︎
I shortened some sequences for brevity. ↩︎
Thanks to Under the Radar, the Linode box (see the Architecture section of this post) is free for the next four months. ↩︎
On 2G and 3G, cellular voice calls will use the Public Switched Telephone Network (PSTN), which should have better optimization for voice-only traffic: ieeexplore.ieee.org. ↩︎
I initially wanted to build my own VoIP network using Asterisk, a popular framework for VoIP gateways. Although setting up a single VoIP server is fairly trivial, connecting it to the real world requires paying for a SIP trunk and virtual phone number (a Direct Inward Dialing number). Trustworthy SIP trunking services charge both a monthly fee and inbound / outbound rate. Once I factored in the virtual server costs needed to run an Asterisk instance, the final amount came out to over $10 a month — way past my allocated budget. ↩︎
I’ve previously used CMUSphinx for offline speech recognition, but GCP is much easier to use and a better fit for Carbon Call. ↩︎
Apple’s support page says to “Contact your carrier for information” regarding call forwarding on CDMA networks. I use T-Mobile (GSM), so your mileage may vary. ↩︎
I chose to use Python to script the logic and run the Selenium Chrome driver. I usually use Go or Node.js, but my side goal was to learn the basics of Python for this project. ↩︎
It may be a documented feature, but conference calling for Google Voice only works for forwarded calls (out of Google Voice) and when “in-bound call options” are enabled. ↩︎
Some profiling and research shows that the problem lies with Google Voice. It takes around 400ms for audio to go from the telephone network to Hangouts. For whatever reason, the opposite direction “only” takes around 100ms. ↩︎