Powered by AI
  • Home
  • Handbook
    • SQL hangbook
    • R handbook
    • Python handbook
    • Machine learning handbook
    • tensorflowing handbook
    • AI handbook
  • Blog
  • CV / 简历
  • EN/中文
    • English
    • 中文

On this page

  • Online Demo
  • ✨ Core Features
    • 🎤 Speech-to-Text (STT)
    • 🔊 Text-to-Speech (TTS)
  • 🚀 Quick Start
    • Requirements
    • Installation Steps
    • Running the App
  • ☁️ Cloud Deployment
  • 🛠️ Technology Stack

Voice Studio: Text-to-Speech & Speech-to-Text

AI
Audio
ASR
TTS
Streamlit
MLX
Author

Tony D

Published

January 22, 2026

A powerful and elegant Streamlit application that integrates Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) capabilities. This project provides a convenient platform to compare cutting-edge cloud models from NVIDIA and Google with locally optimized MLX models.

  • Speech-to-Text (STT) Interface
  • Text-to-Speech (TTS) Interface

Online Demo

https://jcwinning-speech-text-model.share.connect.posit.cloud/

✨ Core Features

🎤 Speech-to-Text (STT)

  • Google Gemini 2.5 Flash Lite: High-speed, accurate cloud transcription provided via OpenRouter.
  • NVIDIA Parakeet-CTC: Industry-leading ASR performance based on NVIDIA Riva Cloud.
  • Local MLX Models: Private, local transcription optimized specifically for Apple Silicon.
    • GLM-ASR-Nano: Lightweight and efficient.
    • Whisper-Large-v3-Turbo: Top-tier, high-precision transcription model.
  • Dual Input Modes: Supports real-time microphone recording or audio file uploads (WAV, MP3, M4A).
  • Instant Display: Results are displayed immediately as each model finishes, no need to wait for all models.
  • Auto-Normalization: Automatically converts audio to 16kHz mono WAV to ensure maximum recognition accuracy.
  • Results Download: Supports saving transcription results from each model as local .md files.

🔊 Text-to-Speech (TTS)

  • Qwen TTS (DashScope): Natural speech synthesis from Alibaba’s Tongyi Qwen, featuring 7 distinct voices.
  • NVIDIA Riva (Magpie): Professional-grade multilingual synthesis using the latest Magpie-Multilingual model.
  • Dynamic Voice Selection: Offers a wide range of speaker options for both Chinese (Mandarin) and English.

🚀 Quick Start

Requirements

  • Python 3.10+
  • Apple Silicon (for local MLX functionality)
  • API Keys:
    • OpenRouter
    • NVIDIA NIM
    • Alibaba Cloud DashScope

Installation Steps

  1. Clone the repository:

    git clone <repository-url>
    cd ARS
  2. Install dependencies:

    pip install -r requirements.txt
  3. Create a .env file in the project root and fill in the keys:

    OPENROUTER_API_KEY=your_key_here
    DASHSCOPE_API_KEY=your_key_here
    NVIDIA_API_KEY=your_key_here

Running the App

streamlit run app.py

☁️ Cloud Deployment

This project is pre-configured for Streamlit Cloud: - Automatically detects the runtime environment and disables local models (MLX) during cloud deployment to ensure system stability. - API keys can be securely managed via Streamlit’s “Secrets” panel.

🛠️ Technology Stack

  • Interface: Streamlit
  • Local Inference: MLX (optimized for Mac M-series chips)
  • Cloud Services: NVIDIA Riva, OpenRouter (Gemini), Alibaba Cloud DashScope (Qwen)
  • Audio Processing: Wave, SoundFile, Streamlit Mic Recorder

Source Code
---
title: "Voice Studio: Text-to-Speech & Speech-to-Text"
author: "Tony D"
date: "2026-01-22"
categories: [AI, Audio, ASR, TTS, Streamlit, MLX]
image: "images/featured.png"
format:
  html:
    code-fold: true
    code-tools: true
    code-copy: true
execute:
  warning: false
---

A powerful and elegant Streamlit application that integrates **Automatic Speech Recognition (ASR)** and **Text-to-Speech (TTS)** capabilities. This project provides a convenient platform to compare cutting-edge cloud models from NVIDIA and Google with locally optimized MLX models.

::: {.panel-tabset}

## Speech-to-Text (STT) Interface


![](images/my screenshots 3.png)

## Text-to-Speech (TTS) Interface

![](images/my screenshots 2.png)


:::


## Online Demo

[https://jcwinning-speech-text-model.share.connect.posit.cloud/](https://jcwinning-speech-text-model.share.connect.posit.cloud/)

## ✨ Core Features

### 🎤 Speech-to-Text (STT)
- **Google Gemini 2.5 Flash Lite**: High-speed, accurate cloud transcription provided via OpenRouter.
- **NVIDIA Parakeet-CTC**: Industry-leading ASR performance based on NVIDIA Riva Cloud.
- **Local MLX Models**: Private, local transcription optimized specifically for Apple Silicon.
  - **GLM-ASR-Nano**: Lightweight and efficient.
  - **Whisper-Large-v3-Turbo**: Top-tier, high-precision transcription model.
- **Dual Input Modes**: Supports real-time microphone recording or audio file uploads (WAV, MP3, M4A).
- **Instant Display**: Results are displayed immediately as each model finishes, no need to wait for all models.
- **Auto-Normalization**: Automatically converts audio to 16kHz mono WAV to ensure maximum recognition accuracy.
- **Results Download**: Supports saving transcription results from each model as local `.md` files.

### 🔊 Text-to-Speech (TTS)
- **Qwen TTS (DashScope)**: Natural speech synthesis from Alibaba's Tongyi Qwen, featuring 7 distinct voices.
- **NVIDIA Riva (Magpie)**: Professional-grade multilingual synthesis using the latest Magpie-Multilingual model.
- **Dynamic Voice Selection**: Offers a wide range of speaker options for both Chinese (Mandarin) and English.

## 🚀 Quick Start

### Requirements
- Python 3.10+
- Apple Silicon (for local MLX functionality)
- API Keys:
  - [OpenRouter](https://openrouter.ai/)
  - [NVIDIA NIM](https://build.nvidia.com/)
  - [Alibaba Cloud DashScope](https://dashscope.console.aliyun.com/)

### Installation Steps

1. Clone the repository:
   ```bash
   git clone <repository-url>
   cd ARS
   ```

2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

3. Create a `.env` file in the project root and fill in the keys:
   ```env
   OPENROUTER_API_KEY=your_key_here
   DASHSCOPE_API_KEY=your_key_here
   NVIDIA_API_KEY=your_key_here
   ```

### Running the App
```bash
streamlit run app.py
```

## ☁️ Cloud Deployment
This project is pre-configured for **Streamlit Cloud**:
- Automatically detects the runtime environment and disables local models (MLX) during cloud deployment to ensure system stability.
- API keys can be securely managed via Streamlit's "Secrets" panel.

## 🛠️ Technology Stack
- **Interface**: Streamlit
- **Local Inference**: MLX (optimized for Mac M-series chips)
- **Cloud Services**: NVIDIA Riva, OpenRouter (Gemini), Alibaba Cloud DashScope (Qwen)
- **Audio Processing**: Wave, SoundFile, Streamlit Mic Recorder

---
 
 

This blog is built with ❤️ and Quarto.