# Step-Audio-TTS-3B
**Repository Path**: hf-models/Step-Audio-TTS-3B
## Basic Information
- **Project Name**: Step-Audio-TTS-3B
- **Description**: Mirror of https://huggingface.co/stepfun-ai/Step-Audio-TTS-3B
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 4
- **Forks**: 0
- **Created**: 2025-02-19
- **Last Updated**: 2025-09-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
---
license: apache-2.0
pipeline_tag: text-to-speech
---
# Step-Audio-TTS-3B
Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.
This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.
## Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.
Model |
test-zh |
test-en |
CER (%) ↓ |
WER (%) ↓ |
GLM-4-Voice |
2.19 |
2.91 |
MinMo |
2.48 |
2.90 |
Step-Audio |
1.53 |
2.71 |
## Results of TTS Models on SEED Test Sets.
* StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model |
test-zh |
test-en |
CER (%) ↓ |
SS ↑ |
WER (%) ↓ |
SS ↑ |
FireRedTTS |
1.51 |
0.630 |
3.82 |
0.460 |
MaskGCT |
2.27 |
0.774 |
2.62 |
0.774 |
CosyVoice |
3.63 |
0.775 |
4.29 |
0.699 |
CosyVoice 2 |
1.45 |
0.806 |
2.57 |
0.736 |
CosyVoice 2-S |
1.45 |
0.812 |
2.38 |
0.743 |
Step-Audio-TTS-3B-Single |
1.37 |
0.802 |
2.52 |
0.704 |
Step-Audio-TTS-3B |
1.31 |
0.733 |
2.31 |
0.660 |
Step-Audio-TTS |
1.17 |
0.73 |
2.0 |
0.660 |
## Performance comparison of Dual-codebook Resynthesis with Cosyvoice.
Token |
test-zh |
test-en |
CER (%) ↓ |
SS ↑ |
WER (%) ↓ |
SS ↑ |
Groundtruth |
0.972 |
- |
2.156 |
- |
CosyVoice |
2.857 |
0.849 |
4.519 |
0.807 |
Step-Audio-TTS-3B |
2.192 |
0.784 |
3.585 |
0.742 |
# More information
For more information, please refer to our repository: [Step-Audio](https://github.com/stepfun-ai/Step-Audio).