That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.
I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.
A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.
I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.
A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.