Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The only reason to use Hadoop is if you need a Shuffle phase, i.e. the intermediate data between Map and Reduce is too big to fit on one machine. If you have big input but small append-only output, use a work-queue (SQS or MySQL/PostGres will let you set this up in minutes), dump to files, and merge them with gzcat | uniq | gzip > output.txt.gz. If you have big input but a small volume of update-based processing (< ~1000 or so outputs/sec), have your workers update an RDBMS. If you have small input but big output, build up output file chunks and progressively write to S3 and delete them. If both your input & output fit on disk, use UNIX pipelines and command-line tools. If they fit in RAM, just load them with Pandas or equivalent and manipulate them in your favorite interactive programming language.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: