The lightning-fast, open-source distributed computing engine for big data processing, streaming, machine learning, and graph analytics.
Core Modules
Spark provides a unified platform covering all major big-data workloads through specialized libraries built on its core engine.
The foundation providing distributed task scheduling, memory management, fault recovery, and the RDD (Resilient Distributed Dataset) abstraction.
Query structured data using SQL or the DataFrame/Dataset API. Integrates seamlessly with Hive, Parquet, JSON, JDBC, and more.
Build real-time pipelines with the same DataFrame API used for batch processing. Supports exactly-once semantics and event-time windowing.
Distributed machine learning with algorithms for classification, regression, clustering, collaborative filtering, and pipeline building.
Graph computation and analysis at scale using the property graph model. Includes PageRank, connected components, triangle counting, and more.
Run pandas code at scale without changes. Drop-in replacement for distributed DataFrame workloads using familiar pandas syntax.
Internals
Click on any component to learn what it does in the Spark execution model.
Data Abstractions
The foundational low-level API. Immutable, fault-tolerant collections partitioned across nodes. Full control, but verbose. Best for unstructured data.
Distributed table with named columns, schema-aware. SQL-like operations, Catalyst optimizer, and Tungsten execution. The go-to for structured data.
Combines DataFrame's optimizer benefits with RDD's compile-time type safety. The best of both worlds for JVM users.
Examples
Real code snippets across languages and APIs — from word count classics to MLlib pipelines.
# PySpark DataFrame – Word Count from pyspark.sql import SparkSession from pyspark.sql.functions import explode, split, lower, col spark = SparkSession.builder \ .appName("WordCount") \ .getOrCreate() # Read text file from S3 df = spark.read.text("s3://my-bucket/data/*.txt") words = df.select( explode(split(lower(col("value")), r"\W+")).alias("word") ).filter(col("word") != "") result = words.groupBy("word").count().orderBy("count", ascending=False) result.show(20) # Write result as Parquet result.write.mode("overwrite").parquet("s3://my-bucket/output/wordcount")
// Scala – Compute average salary by department import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ val spark = SparkSession.builder() .appName("SalaryAnalysis") .getOrCreate() import spark.implicits._ val employees = spark.read .option("header", "true") .option("inferSchema", "true") .csv("hdfs:///data/employees.csv") val avgSalary = employees .groupBy("department") .agg(avg("salary").alias("avg_salary"), count("*").alias("headcount")) .orderBy(desc("avg_salary")) avgSalary.show()
-- Spark SQL – Window functions & CTEs CREATE OR REPLACE TEMP VIEW sales AS SELECT * FROM parquet.`s3://bucket/sales/`; WITH ranked AS ( SELECT region, product, revenue, RANK() OVER ( PARTITION BY region ORDER BY revenue DESC ) AS rnk FROM sales WHERE year = 2024 ) SELECT * FROM ranked WHERE rnk <= 3 ORDER BY region, rnk;
# Structured Streaming – Kafka → Aggregation → Console from pyspark.sql.functions import from_json, col, window from pyspark.sql.types import StructType, StringType, DoubleType schema = StructType() \ .add("user", StringType()) \ .add("event", StringType()) \ .add("amount", DoubleType()) \ .add("ts", StringType()) raw = spark.readStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "broker:9092") \ .option("subscribe", "transactions") \ .load() events = raw.select(from_json(col("value").cast("string"), schema).alias("d")).select("d.*") agg = events \ .withWatermark("ts", "10 minutes") \ .groupBy(window(col("ts"), "5 minutes"), "event") \ .sum("amount") query = agg.writeStream.format("console").start() query.awaitTermination()
# MLlib – Classification Pipeline from pyspark.ml import Pipeline from pyspark.ml.feature import Tokenizer, HashingTF, IDF from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator # Training data: (label, text) data = spark.createDataFrame([ (1.0, "spark is fast and scalable"), (0.0, "hadoop mapreduce is slow"), (1.0, "in-memory computation wins"), ], ["label", "text"]) tokenizer = Tokenizer(inputCol="text", outputCol="words") tf = HashingTF(inputCol="words", outputCol="rawFeatures") idf = IDF(inputCol="rawFeatures", outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, tf, idf, lr]) model = pipeline.fit(data) evaluator = BinaryClassificationEvaluator() print(f"AUC: {evaluator.evaluate(model.transform(data)):.3f}")
Benchmarks
Spark's in-memory engine delivers order-of-magnitude speedups over disk-based alternatives.
Spark builds a Directed Acyclic Graph of operations, optimizing the entire query plan before executing — minimizing shuffles and redundant reads.
Rule-based and cost-based optimizer rewrites query plans using algebraic transformations, predicate pushdown, and join reordering.
Whole-stage code generation compiles query plans to optimized JVM bytecode, leveraging CPU cache locality and SIMD instructions.
Dynamically adjusts query plans at runtime based on statistics collected during shuffles — no manual tuning needed.
Applications
From real-time fraud detection to petabyte-scale ETL, Spark is the engine behind modern data infrastructure.
Process millions of transactions per second, applying ML models to flag anomalies in real time.
Streaming + MLlibNetflix, Spotify, and YouTube use Spark's ALS algorithm to power personalized recommendations at scale.
MLlib · ALSTransform terabytes of raw logs, clickstreams, and IoT sensor data into clean, queryable warehouses.
Spark SQL · ParquetAnalyze whole-genome sequencing data, variant calling, and drug discovery pipelines at population scale.
Hail · ADAMRisk modeling, backtesting trading strategies, regulatory reporting, and market data aggregation.
Spark SQL · PythonIngest billions of server logs per day, detect patterns, build dashboards, and trigger alerts.
Streaming · ELKPreprocess and tokenize training corpora for large language models, embedding generation at petabyte scale.
Mosaic · Spark NLPSocial network analysis, PageRank for SEO, supply chain optimization, and knowledge graph construction.
GraphX · GraphFramesTools & Integrations
Spark sits at the center of a rich ecosystem of data tools, platforms, and cloud services.
Managed Spark platform by the creators of Spark. Delta Lake, notebooks, MLflow.
HDFS storage and YARN cluster manager commonly used with Spark.
High-throughput message broker for ingesting real-time data into Structured Streaming.
ACID transactions, schema enforcement, and time travel on top of Parquet files.
Upserts and incremental processing for data lakes on S3/HDFS.
Open table format with snapshot isolation, schema evolution, and partition pruning.
Metastore and HQL; Spark can read/write Hive tables natively.
Complementary streaming engine; sometimes used alongside Spark.
Managed Spark clusters on EC2 with S3 integration.
Integrated analytics service with Spark pools and SQL DW.
Managed Spark/Hadoop on GCP with BigQuery connector.
ML lifecycle management: tracking, model registry, deployment.
Timeline
From a research paper at UC Berkeley to the most widely deployed big data engine in the world.
Comparison
How Spark compares to other popular data processing technologies.
| Feature | Apache Spark | Hadoop MapReduce | Apache Flink | Dask (Python) | DuckDB |
|---|---|---|---|---|---|
| Processing Model | Batch + Streaming | Batch only | True Streaming | Batch + Streaming | Single-node batch |
| Speed | Very Fast (in-memory) | Slow (disk I/O) | Very Fast | Fast (single node) | Extremely fast (OLAP) |
| Fault Tolerance | RDD lineage + checkpointing | Replication + re-execution | Checkpointing | Task retry | Transactions (WAL) |
| ML Support | Built-in MLlib | None | FlinkML (limited) | Via sklearn | None |
| Graph Analytics | GraphX built-in | None | Gelly (limited) | None | None |
| SQL Support | Full Spark SQL | Via Hive only | Flink SQL | None native | Full ANSI SQL |
| Scale | Petabyte+ | Petabyte+ | Petabyte+ | Up to ~TB | Up to ~100GB |
| Ease of Use | High (DataFrame API) | Low (verbose Java) | Moderate | High (pandas-like) | Very High (SQL) |
| Streaming Latency | Seconds (micro-batch) | N/A | Milliseconds (true stream) | Seconds | N/A |
| Languages | Python, Scala, Java, R, SQL | Java, Python (limited) | Java, Scala, Python, SQL | Python | SQL, Python, R, Java, C++ |
Test Your Knowledge
Answer 8 questions to test how well you know Apache Spark.