Apache Spark — Complete Guide

Core Modules

Five Pillars of Spark

Spark provides a unified platform covering all major big-data workloads through specialized libraries built on its core engine.

🔥

Spark Core

The foundation providing distributed task scheduling, memory management, fault recovery, and the RDD (Resilient Distributed Dataset) abstraction.

📊

Spark SQL

Query structured data using SQL or the DataFrame/Dataset API. Integrates seamlessly with Hive, Parquet, JSON, JDBC, and more.

🌊

Structured Streaming

Build real-time pipelines with the same DataFrame API used for batch processing. Supports exactly-once semantics and event-time windowing.

🤖

MLlib

Distributed machine learning with algorithms for classification, regression, clustering, collaborative filtering, and pipeline building.

🕸️

GraphX

Graph computation and analysis at scale using the property graph model. Includes PageRank, connected components, triangle counting, and more.

🐼

Pandas API on Spark

Run pandas code at scale without changes. Drop-in replacement for distributed DataFrame workloads using familiar pandas syntax.

Internals

How Spark Works

Click on any component to learn what it does in the Spark execution model.

🖥️ Driver Program (SparkContext)

↓ submits jobs ↓

⚙️ Cluster Manager

↓ allocates resources ↓

🔧 Executor 1
Tasks · Cache

🔧 Executor 2
Tasks · Cache

🔧 Executor 3
Tasks · Cache

🔧 Executor N
Tasks · Cache

↓ read/write ↓

🗄️ HDFS

🪣 S3

☁️ GCS

🐘 Hive

🔵 Kafka

💡 Click any component above to learn what it does in Spark's distributed execution model.

Data Abstractions

RDD → DataFrame → Dataset

📦

RDD (Resilient Distributed Dataset)

The foundational low-level API. Immutable, fault-tolerant collections partitioned across nodes. Full control, but verbose. Best for unstructured data.

📋

DataFrame

Distributed table with named columns, schema-aware. SQL-like operations, Catalyst optimizer, and Tungsten execution. The go-to for structured data.

🎯

Dataset (Scala/Java)

Combines DataFrame's optimizer benefits with RDD's compile-time type safety. The best of both worlds for JVM users.

Examples

Spark in Action

Real code snippets across languages and APIs — from word count classics to MLlib pipelines.

# PySpark DataFrame – Word Count
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, lower, col

spark = SparkSession.builder \
    .appName("WordCount") \
    .getOrCreate()

# Read text file from S3
df = spark.read.text("s3://my-bucket/data/*.txt")

words = df.select(
    explode(split(lower(col("value")), r"\W+")).alias("word")
).filter(col("word") != "")

result = words.groupBy("word").count().orderBy("count", ascending=False)
result.show(20)

# Write result as Parquet
result.write.mode("overwrite").parquet("s3://my-bucket/output/wordcount")

// Scala – Compute average salary by department
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._

val spark = SparkSession.builder()
  .appName("SalaryAnalysis")
  .getOrCreate()

import spark.implicits._

val employees = spark.read
  .option("header", "true")
  .option("inferSchema", "true")
  .csv("hdfs:///data/employees.csv")

val avgSalary = employees
  .groupBy("department")
  .agg(avg("salary").alias("avg_salary"),
       count("*").alias("headcount"))
  .orderBy(desc("avg_salary"))

avgSalary.show()

-- Spark SQL – Window functions & CTEs
CREATE OR REPLACE TEMP VIEW sales AS
SELECT * FROM parquet.`s3://bucket/sales/`;

WITH ranked AS (
  SELECT
    region,
    product,
    revenue,
    RANK() OVER (
      PARTITION BY region
      ORDER BY revenue DESC
    ) AS rnk
  FROM sales
  WHERE year = 2024
)
SELECT * FROM ranked
WHERE rnk <= 3
ORDER BY region, rnk;

# Structured Streaming – Kafka → Aggregation → Console
from pyspark.sql.functions import from_json, col, window
from pyspark.sql.types import StructType, StringType, DoubleType

schema = StructType() \
    .add("user", StringType()) \
    .add("event", StringType()) \
    .add("amount", DoubleType()) \
    .add("ts", StringType())

raw = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "broker:9092") \
    .option("subscribe", "transactions") \
    .load()

events = raw.select(from_json(col("value").cast("string"), schema).alias("d")).select("d.*")

agg = events \
    .withWatermark("ts", "10 minutes") \
    .groupBy(window(col("ts"), "5 minutes"), "event") \
    .sum("amount")

query = agg.writeStream.format("console").start()
query.awaitTermination()

# MLlib – Classification Pipeline
from pyspark.ml import Pipeline
from pyspark.ml.feature import Tokenizer, HashingTF, IDF
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Training data: (label, text)
data = spark.createDataFrame([
    (1.0, "spark is fast and scalable"),
    (0.0, "hadoop mapreduce is slow"),
    (1.0, "in-memory computation wins"),
], ["label", "text"])

tokenizer = Tokenizer(inputCol="text", outputCol="words")
tf        = HashingTF(inputCol="words", outputCol="rawFeatures")
idf       = IDF(inputCol="rawFeatures", outputCol="features")
lr        = LogisticRegression(maxIter=10)

pipeline = Pipeline(stages=[tokenizer, tf, idf, lr])
model    = pipeline.fit(data)

evaluator = BinaryClassificationEvaluator()
print(f"AUC: {evaluator.evaluate(model.transform(data)):.3f}")

Benchmarks

Performance at Scale

Spark's in-memory engine delivers order-of-magnitude speedups over disk-based alternatives.

Relative Processing Speed (higher = faster)

Spark (memory)

100×

Spark (disk)

10×

Flink

~75×

Hadoop MR

1×

Hive (MR)

~2×

⚡

DAG Execution

Spark builds a Directed Acyclic Graph of operations, optimizing the entire query plan before executing — minimizing shuffles and redundant reads.

🧠

Catalyst Optimizer

Rule-based and cost-based optimizer rewrites query plans using algebraic transformations, predicate pushdown, and join reordering.

🚀

Tungsten Engine

Whole-stage code generation compiles query plans to optimized JVM bytecode, leveraging CPU cache locality and SIMD instructions.

🗃️

Adaptive Query Execution

Dynamically adjusts query plans at runtime based on statistics collected during shuffles — no manual tuning needed.

Applications

What Spark Powers

From real-time fraud detection to petabyte-scale ETL, Spark is the engine behind modern data infrastructure.

🛡️

Fraud Detection

Process millions of transactions per second, applying ML models to flag anomalies in real time.

Streaming + MLlib

🎬

Recommendation Engines

Netflix, Spotify, and YouTube use Spark's ALS algorithm to power personalized recommendations at scale.

MLlib · ALS

🏗️

ETL Pipelines

Transform terabytes of raw logs, clickstreams, and IoT sensor data into clean, queryable warehouses.

Spark SQL · Parquet

🧬

Genomics & Life Sciences

Analyze whole-genome sequencing data, variant calling, and drug discovery pipelines at population scale.

Hail · ADAM

📈

Financial Analytics

Risk modeling, backtesting trading strategies, regulatory reporting, and market data aggregation.

Spark SQL · Python

🌐

Log Analytics

Ingest billions of server logs per day, detect patterns, build dashboards, and trigger alerts.

Streaming · ELK

🤖

LLM & AI Pipelines

Preprocess and tokenize training corpora for large language models, embedding generation at petabyte scale.

Mosaic · Spark NLP

🔗

Graph Analytics

Social network analysis, PageRank for SEO, supply chain optimization, and knowledge graph construction.

GraphX · GraphFrames

Tools & Integrations

The Spark Ecosystem

Spark sits at the center of a rich ecosystem of data tools, platforms, and cloud services.

🏔️

Databricks

Managed Spark platform by the creators of Spark. Delta Lake, notebooks, MLflow.

🐘

Apache Hadoop

HDFS storage and YARN cluster manager commonly used with Spark.

🔵

Apache Kafka

High-throughput message broker for ingesting real-time data into Structured Streaming.

🏗️

Delta Lake

ACID transactions, schema enforcement, and time travel on top of Parquet files.

🪣

Apache Hudi

Upserts and incremental processing for data lakes on S3/HDFS.

🧊

Apache Iceberg

Open table format with snapshot isolation, schema evolution, and partition pruning.

🐝

Apache Hive

Metastore and HQL; Spark can read/write Hive tables natively.

🌊

Apache Flink

Complementary streaming engine; sometimes used alongside Spark.

☁️

AWS EMR

Managed Spark clusters on EC2 with S3 integration.

🌩️

Azure Synapse

Integrated analytics service with Spark pools and SQL DW.

🔮

Google Dataproc

Managed Spark/Hadoop on GCP with BigQuery connector.

🧪

MLflow

ML lifecycle management: tracking, model registry, deployment.

Timeline

The Story of Spark

From a research paper at UC Berkeley to the most widely deployed big data engine in the world.

2009

Born at AMPLab, UC Berkeley

Matei Zaharia starts Spark as a research project to address Hadoop MapReduce's latency limitations, introducing the RDD abstraction.

2010

First Paper & Open Source Release

Spark is open-sourced under BSD license. The paper "Spark: Cluster Computing with Working Sets" is published at HotCloud.

2013

Apache Incubator · Databricks Founded

Spark enters Apache Incubator. Matei Zaharia and colleagues found Databricks to commercialize Spark.

2014

Apache Top-Level Project · Spark 1.0

Spark graduates to a top-level Apache project. Spark 1.0 ships with Spark SQL and MLlib. Sets world record in sorting 100TB in 23 minutes.

2016

Spark 2.0 — Structured APIs

Major release introducing Structured Streaming, Datasets, the Catalyst optimizer, and Tungsten 2.0. DataFrame API becomes the default.

2020

Spark 3.0 — Adaptive Query Execution

AQE, dynamic partition pruning, Pandas UDF improvements, and Python 2 dropped. Spark handles exabyte-scale workloads in production.

2022

Spark 3.3 — Pandas API on Spark

Full Pandas API compatibility introduced, allowing data scientists to scale existing pandas code without rewrites.

2024

Spark 4.0 — Python Dataframe Client

Spark Connect matures, enabling lightweight Python clients to talk to remote Spark servers. Spark ANSI SQL compliance reaches new heights.

Comparison

Spark vs. The World

How Spark compares to other popular data processing technologies.

Feature	Apache Spark	Hadoop MapReduce	Apache Flink	Dask (Python)	DuckDB
Processing Model	Batch + Streaming	Batch only	True Streaming	Batch + Streaming	Single-node batch
Speed	Very Fast (in-memory)	Slow (disk I/O)	Very Fast	Fast (single node)	Extremely fast (OLAP)
Fault Tolerance	RDD lineage + checkpointing	Replication + re-execution	Checkpointing	Task retry	Transactions (WAL)
ML Support	Built-in MLlib	None	FlinkML (limited)	Via sklearn	None
Graph Analytics	GraphX built-in	None	Gelly (limited)	None	None
SQL Support	Full Spark SQL	Via Hive only	Flink SQL	None native	Full ANSI SQL
Scale	Petabyte+	Petabyte+	Petabyte+	Up to ~TB	Up to ~100GB
Ease of Use	High (DataFrame API)	Low (verbose Java)	Moderate	High (pandas-like)	Very High (SQL)
Streaming Latency	Seconds (micro-batch)	N/A	Milliseconds (true stream)	Seconds	N/A
Languages	Python, Scala, Java, R, SQL	Java, Python (limited)	Java, Scala, Python, SQL	Python	SQL, Python, R, Java, C++

Test Your Knowledge

Spark Quiz

Answer 8 questions to test how well you know Apache Spark.

Loading question…

Question 1 / 8

ApacheSpark