Spark doesn’t have a hard dependency on Hadoop. Spark doesn’t have a storage eng...

atomicity · on Dec 9, 2019

Spark still depends on Hadoop for a lot:

- Using Parquet files = parquet-mr which is tied to Hadoop MR https://github.com/apache/spark/tree/master/sql/core/src/mai...

- Using S3 instead of HDFS = Hadoop S3a connector

Even if you don't run HDFS and YARN, you aren't escaping Hadoop. And if some configuration goes wrong, and you'll probably need to look into the Hadoop conf files.

The original comment was about the mass of libraries that Hadoop brings in. Spark isn't a solution that allows you to leave the mess. If you try to dockerize spark, you'll still see that you have 300 MB size images full of JARs that came from wherever.

dig1 · on Dec 9, 2019

Yes, but my comment was about serious, production grade setup.