How iOS and macOS Dictation Can Learn from Voice Control’s Dictation

posted in: September 2020 | 0
9 comments

Speech recognition has long been the holy grail of computer data input. Or, rather, we have mostly wanted to control our computers via voice—see episodes of Star Trek from the 1960s. The problem has always been that what we want to do with our computers doesn’t necessarily lend itself to voice interaction. That’s not to say it can’t be done. The Mac has long had voice control, and the current incarnation in macOS 10.15 Catalina is pretty good for those who rely on it. However, the simple fact is that modern-day computer interfaces are designed to be navigated and manipulated with a pointing device and a keyboard.

More interesting is dictation, where you craft text by speaking to your device rather than by typing on a keyboard. (And yes, I dictated the first draft of this article.) Dictation is a skill, but it’s one that many lawyers and executives of yesteryear managed to pick up. More recently, we’ve become used to dictating short text messages using the dictation capabilities in iOS.

Dictation in iOS is far from perfect, but when the alternative is typing on a tiny virtual keyboard, even imperfect voice input is welcome. Most frustrating is that you cannot fix mistakes with your voice while dictating, so you end up either having to put up with mistakes in your text or use clumsy iOS editing techniques. By the time you’ve edited your text onscreen, you may as well have typed it from scratch.

macOS has also had dictation features for years, but it has been even less successful and less commonly used than iOS’s feature, in part because it requires so much more setup than just tapping a button on a virtual keyboard.

With iOS 13 and Catalina, Apple significantly beefed up its voice control capabilities and simultaneously introduced what seems to be an entirely different dictation technology—call it “Voice Control dictation,” which I’ll abbreviate to VCD here. In many ways, VCD is better than the dictation built into iOS and macOS. An amalgamation of the two technologies would be ideal.

What’s Wrong and Right with iOS and macOS Dictation

The big problem with dictation in iOS and macOS is that, when it makes mistakes, there’s no way to fix them. But there are other issues. To start, you have to tap a microphone button on the keyboard (iOS) or press a key on the keyboard twice (Mac, set in System Preferences > Keyboard > Dictation) to initiate dictation. That’s sensible, of course, but it does mean that you have to touch your keyboard every time you want to dictate a new message. And that, in turn, means that you cannot just carry on a conversation in Messages, say, without constant finger interaction, which defeats the purpose.

Enabling dictation in iOS and macOS

Another problem with dictation in iOS and macOS is that it works for only a certain amount of time—about 60 seconds (iOS) or 40 seconds (macOS) in my testing. As a result, you cannot dictate a document, or even more than a paragraph or two, without having to restart dictation by tapping that microphone button.

But the inability to edit spoken text is the real problem. There is little more frustrating than seeing a mistake being made in front of your eyes and knowing that there is no way to fix it until you stop dictating. And once you have stopped, fixing a mistake is tedious at best, even now that you can drag the insertion point directly in iOS. iOS just isn’t built for text editing. Editing after the fact is much easier on the Mac, of course, but you can’t so much as click the mouse while dictating without stopping the dictation.

On the plus side, dictation in iOS and macOS seems to be able to adjust its recognition based on subsequent words that you speak. You can even see it doing this sometimes, changing a word back-and-forth between two possibilities as you continue to speak. Other times, changes won’t be made until you tap the microphone button to start or your dictation time runs out. Regardless, it’s good—if a little weird—to see Apple adjusting words based on context rather than brute force recognition.

What’s Right and Wrong with Voice Control Dictation

The dictation capabilities built into Apple’s new Voice Control system are quite different. First, instead of navigating to Settings > Accessibility > Voice Control (iOS) or System Preferences > Accessibility > Voice Control (macOS), you can enable Voice Control via Siri—just say “Hey Siri, turn on Voice Control.” Once it’s on, whenever a text field or text area has an insertion point, you can simply speak to dictate text into that spot. You can, of course, also speak commands, but that takes more getting used to.

Unlike the standard dictation, however, VCD stays on indefinitely. You just keep talking, and it will keep typing out whatever you say into your document.

The most significant win, however, is that you can edit the mistakes that VCD makes. For instance, in the previous sentence, it initially capitalized the word “However.” (It has a bad habit of capitalizing words that follow commas.) By merely saying the words “lowercase however,” I was able to fix the problem. Those who are paying attention will note that the word “however” has appeared several times in this article. How does Voice Control know what to fix? It prompts you by displaying numbers next to each instance of the word; you then speak the number of the one you want to change. It’s slow but effective.

There is another approach, too, although it works best on the Mac. If you select some text, which you might do with a finger or a keyboard on an iPhone or iPad, or with a mouse or trackpad on a Mac, you can then direct Voice Control to act on that particular text. For instance, in the previous sentence, VCD didn’t initially capitalize the words “voice control.” That wasn’t a mistake; I’m capitalizing those words because I’m talking about a particular feature, but they would not generally be capitalized. Nevertheless, I can select those two words with the mouse and say, “capitalize that,” to achieve the desired effect. This is a surprisingly effective way to edit. It’s easy and intuitive to select with the mouse and then make a change with your voice without having to move your hands back to the keyboard.

Some mistakes are easily fixed. When I said above, “it prompts you,” VCD gave me the word “impromptu.” All I had to do was say, “change impromptu to it prompts you,” and Voice Control immediately fixed its mistake. When that works, it feels like magic, particularly in iOS. Whenever I’m using a Mac, I prefer to select with the mouse and replace using my voice.

Of course, there are situations where voice editing falls down completely. Several times while dictating this article, I used the word “by.” VCD interpreted that as the word “I” most of the time, and no matter how I tried to edit it with my voice, the best I could do was the word “bye” and the command “delete previous character.” Or, when I wanted the word “effect” above, I ended up with “affect.” It was likely my fault for not pronouncing the word clearly enough. But when I tried “change affect to effect,” Voice Control treated me to “eat fact” the first time and “ethernet fact” the second time. Maddening! It’s strange, because if I just say the word “effect” on its own while emphasizing the “ee” sound at the start, it works fine.

There are other annoyances. With all dictation, you must, of course, speak punctuation out loud, which is awkward and requires retraining your brain slightly. If VCD interprets a word as plural instead of possessive, you can move the insertion point in front of the “s” and say, “apostrophe,” but will put a space in front of the apostrophe, requiring yet more commands to fix the word. And just try getting VCD to write out the word “apostrophe” or “colon” or “period” instead of the punctuation mark.

Another issue that afflicts all dictation systems is the problem with homonyms. Without context, there is simply no way to distinguish between “would” and “wood,” or “its” and “it’s,” or “there” and “their” and “they’re,” by sound alone. VCD has no advantage here; standard dictation may do better.

Careful elocution is essential for recognition success when working with VCD (not that it ever recognizes the word “elocution” correctly). It is probably a good habit to get into. Many of us—myself included—slur our words together while speaking. It’s amazing that speech recognition works at all, given how sloppily we speak.

Unfortunately, VCD doesn’t work everywhere. On the Mac, I can’t get it to work in BBEdit or in Google Docs in a Web browser. In iOS, it has fewer problems, although I’m sure I’ve hit some in the past. I haven’t attempted to produce a comprehensive overview of where it works and where it doesn’t, so suffice it to note that it may not always work when you want.

Another problem, primarily in iOS, is that leaving VCD on all the time is a recipe for confusion because it will pick up other people speaking as well, or even music or other audio playing in the background. Luckily, you can always ask Siri to “turn off voice control” to disable it. Also, if you leave VCD on all the time, it will negatively impact your battery life.

Why Can’t We Have the Best of Both Worlds?

It doesn’t seem as though Apple would have that much work to do to bring the best of VCD’s features to the standard dictation capabilities in iOS and macOS. All that’s necessary is for the company to stop seeing VCD as purely an accessibility feature, instead of something that could be of use to everyone.

The most important change would be to enable dictation to be invoked easily and stay on indefinitely. In iOS, I could imagine tapping the microphone button twice, much like tapping the Shift key twice turns on Caps Lock. On the Mac, perhaps tapping the dictation hotkey three times could lock it on until you turn it off again. That would let you dictate longer bits of text without having to leave Voice Control on at all times or rely on Siri to turn it on and off.

Next, all of VCD’s voice editing capabilities need to migrate to the standard dictation feature. I see no reason why Apple has made VCD so much more capable in this way, and it shouldn’t be hard to reuse the same code.

Finally, you should be able to move the insertion point around and select words while dictating. It’s ridiculous that any such action stops dictation in iOS and macOS now.

If it sounds like I’m suggesting that Apple replace standard dictation with a form of VCD that’s more easily turned on and off, that’s correct. Apart from occasionally improved recognition of words by context as you continue to speak, standard dictation simply doesn’t match up to VCD in nearly any way.

Unfortunately, as far as I can tell in the current betas of iOS 14 and macOS 11 Big Sur, Apple has made no significant changes to either standard dictation or VCD. So we’ll probably have to wait another year or more before such improvement could see the light of day.