Originally posted July 2007

So I was playing with this text-to-speech demo earlier. Then I was thinking about this Richard Bartle article on how cool it would be to have an MMO that could use voice chat to do speech-to-text, transcribe it to the chat window, and also text-to-speech it to change your voice chat to sound like your character.

Then, I remembered this neat video. It’s apparently a face rendering program where they made a big average face composite out of a lot of different faces, and tagged it with the different ways it might deform when deviating from the average. Apparently, they can come up with fairly accurate looking 3D models of faces from single 2D images, by mapping the deviation from the average and then applying that to the model and extrapolating from there.

So, I’m wondering if the same type of thing could be done with voice recognition. Record a lot of random people reading through a long series of text, and split out their recordings into pitch, tempo, inflection, and all of that and create a tagged average voice. There might be some underlying constants you could find across all speech, or you might have to tie it to ways of phrasing individual words.

Then, someone takes the system, reads some standardized text to establish a baseline, and the system maps that person’s vocal deviation from the average voice. When the person chats online, the system has a tagged way to do speech-to-text. Additionally, the system can take what was said, remove the user’s deviation (leaving the average voice plus unusual stresses and pacing) and then apply the deviation of another voice that represents your character. By keeping the inflections and stresses that were unusual to the user’s normal deviation, and applying them to the synthesized voice, you might come up with a much more natural sounding text-to-speech/voice-masking output.