Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.

I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.

A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: