Audiobooks, AI and Me

I am participating in a Beta test using AI voices to generate audiobooks. Interesting. And not altogether a bad experience, but a time-consuming one. Most audiobooks run about 20-26 hours, so there you go. But if it works, it makes the creation of audiobooks available to authors without the loss of an arm and/or a leg.

So, things I’m learning.

Choosing a Voice

Listen to all the voice options: male, female, American or British, then pick the narrator your reader would expect. Right now, the options are limited. I chose a twenty-something female for one book and a late-twenties-sounding male for another.

A real plus is that you can change narrators mid-stream without losing your edits.

Don’t watch the screen — Listen

The AI I’m trying out has a marker on the screen that moves from word to word with the narration. If you watch the screen while listening, the paced rhythm of the marker takes all meaning out of the sentence – dah dah dah dah. So shut your eyes and listen for the modulation in each phrase and where inflection changes the intended meaning of a sentence.

As you listen to the very human voice, it is easy to forget it is machine-produced. Yet, it can’t interpret the narrative like a voice actor or a reader would. So, you will need to intercede. Currently, choices are limited: speed a word up, slow it down, or add long or short pauses. Speeding up and slowing down words can affect modulation in unexpected ways, requiring several passes to get the inflection just right. It would be nice to be able to modulate the timbre of the voice for those occasions when a question needs to end other than on an upbeat, but it is not available.

Adding long and short pauses is critical to pacing and understanding. For instance, rapid banter, easily understood on the printed page, needs pauses between speakers to assist the listener. Without a pause, the exchanged dialog becomes a jumble, losing its spice as the listener struggles to figure out who said what to whom.

Listen to each word — Carefully.

The voice replicates standard English; is there such a thing? So, homophones are an issue. For instance, bow (beau) consistently being pronounced bow (as in bowed before the king) no matter the context. And the verb does pronounced as though it were more than one female deer.

The pronunciation pop-up doesn’t translate diacritical markings, which means you have to find a set of letters that creates the correct sound. For instance, duz gets you does and avoids a herd of deer roaming your book. The good news is you can apply pronunciation changes to all instances of the word in the text. The bad news, well, read on.

Users are warned to listen to the complete book before accepting the audiobook conversion; heed it. Else, this could happen. Crappie fish may be croppy to you and me, but not to the AI voice who happily asserted that crappy fish swam in a pond. And in what makes no sense, the voice insists that bass is pronounced base, as in bass violin, and will not say bass, as in fish. And the letters bass, which should produce the correct sound by all rules of the English language, don’t. Nor do b ass, baz, or bahz or, well, anything. English is a minefield of weirdness. But as it turns out, the AI voice is very good with French, thus beau.

Then there are em dashes? Well, imagine my surprise when the voice opined: yes, dash, she changed. Using the pronunciation feature, I tried substituting a fast uh, but that—uh — isn’t always appropriate. So, what do you do? Sometimes, I add a word to make a stutter. Sometimes, I fill in the blank with the missing word(s). It is a conundrum. If the dash is set off with spaces, the voice says dash and if it isn’t, it runs the wordstogether.

What I’ve learned — Mostly

Listen. Listen twice. Learn your options for editing, fast, slow, and pauses and how they affect pace and modulation. Watch out for homophones, some are truly unexpected – as the female deer attest. Watch also for possessives, as the voice tends to hesitate for apostrophes, Eliza s, and needs to be overridden. Watch foreign names and words, unless in French. Be chary with em-dashes, though this issue should be addressed by the programmers. For instance, the voice doesn’t have a problem with ellipses. I know. Weird, huh?

And finally, if you find errors in your manuscript while creating the audiobook, don’t be afraid to correct them. The AI I am tinkering with automatically updates the audio text along with the manuscript text. Not bad, that.

See all my books at dzchurch.com where you can also sign up for my newsletter.

12 thoughts on “Audiobooks, AI and Me

  1. Very helpful, thanks! I just started trying out KDP’s beta audio, but I haven’t gotten very far. The problem I discern is, as you said, the voice can be very emotionless. But I’m so pleased and surprised that the AI pronounces Spanish correctly!

    Like

    1. I find that a judicial use of speed and pauses adds emotion, though it takes diligence to get it just right. It gets easier as you go along. Also, there have been some improvements in the voices already.

      Like

  2. Pauses do help, and a medium pause has been added. But until a question can end on a downbeat, and banter can be humorous, there is work to be done. Getting the delivery of dialog and emotion correct is the challenge for the programmers and for any users.

    Like

  3. There’s nothing like being on the front lines of something to tell those of us on the outside what it’s like. Your post is a revelation for me. I haven’t gone into audio books yet but have been thinking about it. You’ve given me even more to consider. Thanks for the funny but terrifying prospect of life in the audio book world.

    Like

  4. Dawn, this is interesting. I think I’ll stick with my narrators they bring the story to life and I don’t have mess with pauses, changing words to be said correctly, and worry that my punctuation will throw a curve ball in things. For me having to just listen to the book once and give the narrators input is less time consuming which is my goal these days. But great insight into something many have been wondering about.

    Like

  5. I can’t help but wonder, if we replace voice actors with AI voices and cover artists with AI artwork”, how long will it be before authors are replaced by AI as well?

    Like

    1. Mollie, I understand they are working on that right now. But no one can write like us. We’re us and they are just AI. Big difference.

      Like

    2. As Heather notes in her comments, emotions seem a bit out of reach for the AI voices. A bit, but as I work more with adding pauses, maybe not out of the realm of possibility. Unfortunately, there are authors who are already happily pirating the works of others via AI to produce passages, if not books. The wraith is out of the box.

      Like

  6. Dawn, this post couldn’t be more timely for me. I am grappling with trying to get one of my books through the early stages of an AI audio and having a devil of a time with it. In particular the dashes make me crazy. It is amazing to me the programmers didn’t think of that because that is the standard way to let the reader know the person hgas been interrupted. Also, it would be great if a few of the basic human emotions could be duplicated i.e., surprise, sadness, fear, stuff like that. I would think that would be easy enough. Haven’t fooled around yet with the pauses, but from what you are saying, that might be a useful tool. I did figure out the sounds-like stuff and have worked that out fairly well. Thanks for the tips. I appreciate them.

    Like

    1. Pauses seem to help with some of the issues you bring forward. And the medium pause has been added. But until a question can end on a downbeat and dialog, specifically banter, be managed to imbed emotion, the programmers have a way to go.

      Like

Comments are closed.