In the same vein, we recently released a version v0.1 of our humor benchmark. [1] We use human answers from a cards against humanity style game call Bad Cards [2] as ground truth for what is funny. The models get to choose a card from a hand of 3-6 cards, so not quite de novo joke creation.
[1] https://goodstartlabs.com/leaderboards/lol-arena
[2] https://bad.cards/