Good question. In my experience combining generic descriptors is what works best. This is probably due to the text captions used during training mostly consist of generic instrument names, genre names and adjectives.
I can't say for sure but I think the issue with why the reverb sounds off in that particular example is that the reverb present in the recording has a longer decay than the maximum reverb duration I set for the experiments (1s). I will set it longer in the future.
Yes, that kind of visualisation could be performed. The authors of the prior work we are building on (DDSP) have made some great visualisations which I think you will find useful! Here: https://storage.googleapis.com/ddsp/index.html
oh nice, i have not. thanks i've got some catching up to do! i see that they used the nsynth dataset, i didn't realize it was publicly available. i recall that the nsynth paper came out just as i was finishing that work, you can imagine i felt a bit scooped ;). (but nsynth was far more impressive so what could i say..)