This is ... not what I expected. It's basically wiring up pre-trained models to ChatGPT via a router and "modality transformations" (a.k.a speech-to-text and text-to-speech).
I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.
I expected it to be a GPT-style model that processes audio directly to perform a ton of speech and maybe speech-text tasks in a zero-shot manner.