I tried to do this last year, but every single twss-related handle was taken. Don't believe me, look for yourself. I can't wait for the day when Twitter starts reclaiming unusued/squatted names (and have a pay for service that let's you claim it for life.)
I think your negative sample set is a little biased. Since all the phrases start with verbs like "was in the car" or "went to the park", these kinds of phrases are given lower probabilities.
For example:
> twss.prob("was on a stiff pole");
0.016050826334564946
Only 1.6% chance of that's what she said?!?
EDIT: Counter example:
> twss.prob("that's one stiff pole");
0.9767718880285885
A while back I was interested in implementing a much less naive algorithm for classifying TWSS expressions, based on this [1] paper. Never actually got around to finishing the work.
There was a similar project before that implemented this for an IRC bot. You can also train this bot by telling it what jokes are good and bad ones. :)
I was wondering if anyone knew of a place where I could learn about this stuff in general.
I know nothing about unigrams, bigrams, trigrams, tf-idf, Bayesian filtering, etc. Maths - while not awful - is not my strongest point, but I think I could grok a well-written tutorial to this stuff (with code examples!).
I was hoping/wondering if anyone knew of sites I could start learning about this from? I find this very interesting, and I'm sure it could be highly useful and applicable to many different types of problems...
DanielRapp: in file twss.js/lib/classifier/knn.js, number of NN should be odd to prevent ties [EDIT: also, NN should be large enough to prevent over-fitting; small NN would mean that the difference (decision boundary) between twss and not-twss is highly non-linear; you need to implement cross-validation to find best NN]
Note to self: machine learning using node.js; what's the speed of calculations, what's the memory management in node.js, can I find pure JS implementation of SVM?
Thanks. I did do a simple analysis[1] and changed it[2] to 5 neighbors. Though when I look at the graph now, I see that 4 is actually the optimal value..
I don't think that would be a classifier, or at least not reasonably. You could have "In Soviet Russie X Y you" for each X,Y as your classes, but that would be unreasonable.
Yakov Smirnoff is a structural joke. You would need to parse sentences, pattern match, transform it, and then do some kind of regression on the phrase to get its humor quotient.
The Stanford Parser for structural parsing, then some custom pattern matching and transforming code, might get you somewhere.
I tried adding dropdowns to change the algorithm and threshold, but changing to knn crashes out ("ReferenceError: trainingPrompt is not defined"), so scrapped that and just left the demo running the defaults.
MRI's have shown that humans are able to do this because of a dedicated site in the brain called "Scott's region". Once activated, this linguistic region is constantly searching for linguistic cues, surfacing signals to our conscious thoughts when the cues are strong enough.
We made our IRC bot respond to TWSS jokes, but ours was just a dumb match from a set of few thousand jokes that we scraped from offline. You can look at the code at:
https://github.com/jfriedly/jenni
Now that I took Stanford's Machine Learning class though, I think I might just duplicate what this guy did for our bot.
While it seems on the surface like a waste of time (albeit amusing one), I actually expect this is a great project to learn from because of its use of Bayesian classifiers.
In other words, I'm TOTALLY going to be using this on my next project.
I just hate when people release JavaScript libraries that needlessly depend on specific platforms. For awhile that dependency was usually jQuery, then with the rise of server-side JavaScript it was the DOM in general, now it appears to be Node.js.
Just write "X for JavaScript", dammit.
That said, this doesn't appear to have any Node.js specific dependencies, it could be used in any CommonJS environment.
The reason I chose node over "browser-js" is because it was originally going to be a Twitter bot, but decided to simplify the GitHub repo into just a node module to make it more useful.
But you're totally correct. This could've easily be written in any language.
The source is open and it is really easy to port it to the browser, so this shouldn't warrant a complaint. Each one works with what he feels more comfortable with.