I’d been eyeing the Qwen TTS blog post for a while. Alibaba had open-sourced a 1.7B parameter text-to-speech model that could clone voices from just a few seconds of audio. The kind of thing that used to require expensive studios and proprietary tech — now running locally on a laptop. I wanted to try it but hadn’t gotten around to it.
Then I saw this on the Forkiverse:
JW tagged Casey Newton and Kevin Roose — the Hardfork podcast hosts — asking if they’d record a voice memo saying happy birthday to their boss Christine. A sweet request, very “old internet” energy.
That was the nudge. What if you didn’t need to ask a podcast host to record a birthday message — what if you could just clone their voice and have the AI do it? And what if the whole thing could be done by just talking to Claude Code?
”Can you read this GitHub page?”
I opened a terminal, started Claude Code, and typed:
can you read this document: https://qwen.ai/blog?id=qwen3tts-0115
I would like to see if I can get this working by training on a voice in a sample file, and have it say happy birthday back to me.
The blog post was JavaScript-rendered, so Claude couldn’t read it. No problem — I gave it the GitHub repo instead:
here is the github page instead: https://github.com/QwenLM/Qwen3-TTS
Claude read through the README, understood the voice cloning flow, and came back with its first round of questions:
Want me to set this up as a working script? I’d need to know:
- Do you have a CUDA GPU available?
- Do you already have a voice sample file?
- Do you want to use conda or a regular venv?
This is one of the things I like about working with Claude Code. It doesn’t just barrel ahead — it figures out what it needs to know first.
I think we should go for regular venv, this is a macbook m4 pro with 24gb ram.
The Python version scare
Claude’s first instinct was to tell me Python 3.14 was too new and suggest downgrading to 3.12. I pushed back:
will installing that version of python in this way leave the default to still be 3.14? And are you sure it does not work with 3.14? I would rather not downgrade.
Claude went and actually checked the pyproject.toml in the repo. The requirement was >=3.9. My 3.14 was fine. No downgrade needed. This is a good reminder: push back when something doesn’t sound right. Claude will go verify.
”Place a voice sample and write the transcript”
After creating the venv and installing dependencies, Claude wrote the initial clone_voice.py script. Then it told me what I needed to do:
- Place a voice sample — put a wav file (~3+ seconds of someone speaking) in the project directory named
voice_sample.wav- Edit the script — open
clone_voice.pyand updateref_textwith the transcript of what’s said in your voice sample- Run it
This is the part people sometimes miss about working with AI coding assistants. It’s not one-directional. Claude handles the engineering, but the human provides the creative input and raw materials. I needed to go find a clip of Casey Newton speaking, save it as a WAV, and transcribe what he said. No AI shortcut for that — I know what voice I want, and I know what it’s saying.
I asked about the audio requirements:
tell me more about the requirement for the wave, what is the format, and how long should it be?
Claude dug into the library source code and came back with the specs: ~5-10 seconds, clear speech, any common format. I grabbed a short clip of Casey talking and dropped it in.
First run: almost
I ran the script and pasted the output back:
okay, I ran it and got this:
/bin/sh: sox: command not foundCan we try and install the flash attention? maybe that will speed things up.
Two issues at once. Claude installed SoX via brew, then explained that flash attention is CUDA-only — the warning was harmless on my M4. On the next run, it worked:
That’s the moment. From “can you read this GitHub page” to a working voice clone in maybe 20 minutes of actual prompting.
Happy birthday, Christine. From AI-Casey.