Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech).

I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.



Take a look at AudioLDM (https://github.com/haoheliu/AudioLDM), it might be more what you expected:

- Text-to-Audio Generation: Generate audio given text input.

- Audio-to-Audio Generation: Given an audio, generate another audio that contain the same type of sound.

- Text-guided Audio-to-Audio Style Transfer: Transfer the sound of an audio into another one using the text description.


so then the training data is text, not audio?


you might be interested in suno-ai/bark




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: