Do you remember how much training time you had to spend on Dragon Naturally Spea...

api · on July 30, 2015

... on a Pentium I, using 1990s machine learning algorithms, sure.

Nobody's answered my question as to why The Cloud is the magic pixie dust that solves this problem, and why it could not be solved locally with modern compute power and modern ML techniques.

gok · on July 30, 2015

There are several tremendous advantages to server-based speech recognition.

Firstly, the models (particularly the language models) needed for state of the art performance are huge. It's not atypical for papers to discuss using a billion n-grams, for example ( https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/ListenTerm1201415/s... ). That's several gigabytes of memory and storage at the very least, and you'd need a copy of that for every spoken language you'd want to support. Plus you need to keep that up to date with new words and phrases; it's much easier to keep models fresh on a server than on everyone's computer.

Power and CPU time are also a concern. Big beefy server farms can have trouble keeping up with state of the art speech recognition algorithms; a laptop, tablet or phone is going to struggle, especially when running off a battery, is at a huge disadvantage.

But the biggest advantage to server-based speech recognition is indeed that more data is critical to improving accuracy and performance. There's no data like more data. And you don't just need more data, you need a lot more data. You can get big gains from just doing unsupervised training on 20 million utterance rather than 2 million: http://static.googleusercontent.com/media/research.google.co... There's simply no way you're going to get anything like 20 million utterances without getting data from millions of real world users.

nl · on July 31, 2015

This isn't actually true.

The large data size affects the training, but the model itself is pretty small now (after some hard work on Google's part).

The thing everyone seems to be missing is that Android's (English) voice recognizer is offline[1]. While you can use the online model I suspect that is more about continual update of the model (so it understands new words and changing accents etc) rather than recognition.

[1] http://stackoverflow.com/questions/17616994/offline-speech-r...

gok · on July 31, 2015

Android's speech recognizer has a compact/offline mode, but that's definitely not what's run by default.

api · on July 30, 2015

Good speech recognition is that expensive?

... and people think sentient AI is on the horizon. :P

zeidrich · on July 30, 2015

Because many people speak similarly. If a number of people who speak similarly can train it, it can learn how you will say words that you haven't even said to it yet if a number of other people already have.

Machine learning algorithms haven't changed that much since the 90s, what's changed is the amount of data we have access to, and the amount of data we can process.

When you're training it yourself the data is what's limited. The fact that we can process more data doesn't matter if we don't have access to more data because you can't speak any faster.

But if you have millions of people speaking to it, then we can take advantage of the fact that we can process so much more data.