Speech Solutions✨

Hosted on 🤗 Hugging Face Spaces

This is a Gradio UI app that combines AI-powered speech and language processing technologies. This app supports the following features:

  • Speech-to-text (WhisperAI)
  • Language translation (GPT-4) (In progress)
  • Improved transcription (GPT-4) (In progress)
  • Text to Speech (In progress)

UPDATE: The app now includes Youtube metadata extraction features: (title / URL / ID, subtitles, tag checking)

NOTE: This app is currently in the process of applying other AI-solutions for other use cases.

OpenAI / Whisper + stable-ts

Open Ai's Whisper is a versatile speech recognition model trained on diverse audio for tasks like multilingual transcription, translation, and language ID. With the help of stable-ts, it provides accurate word-level timestamps in chronological order without extra processing.

Note: The default values are set for balanced and faster processing, you can choose: large, large v2, and large v3 MODEL SIZE for more accuracy, but they may take longer to process.

Source Language
Model Type
Model Size

These settings allow you to customize the segmentation of the audio or video file. Adjust these parameters to control how the segments are created based on characters, words, and lines.

Note: The values currently set are the default values. You can adjust them to your needs, but be aware that changing these values may affect the segmentation of the audio or video file.