Setup Apache Spark with Jupyter Notebook on MacOS
Are you interested in exploring the world of big data and machine learning? Look no further! In this article, we'll take you through a quick and easy guide to installing and configuring Apache Spark with Jupyter Notebook on your MacOS device.
Before we dive into the Apache Spark & Jupyter setup process, let's make sure we have the necessary prerequisites installed. We'll need Python 3, pip and Java.
Install Python3
The first step is to install Python 3 using Homebrew, a popular package manager for MacOS. Open your terminal and run the following command:
brew install python@3.11
Once installed, verify the version by running:
python3 -V
Install pip
Next, we need to install pip, the package installer for Python. Run the following command:
python3.11 -m pip install --upgrade pip
Install Java
Install Java using Homebrew. Open your terminal and run the following command:
brew install java
Once installed, verify the version by running:
java -version
openjdk version "22.0.1" 2024-04-16
Install Apache Spark
Next, we'll install Apache Spark itself using Homebrew.
brew install apache-spark
Install Scala
brew install scala@2.12
scala -version
Scala code runner version 2.12.19
Install pyspark
pip3 install pyspark
# pip3 install pyspark --break-system-packages
pyspark --version
version 3.5.1
Setup Environment Variables:
Depending upon your shell environment zsh or bash setup the below environment variables in ~/.zshrc or ~/.bashrc
# Java, Spark, pyspark
export JAVA_HOME=$(/usr/libexec/java_home)
export SPARK_HOME=/opt/homebrew/Cellar/apache-spark/3.5.1/libexec
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
Apply the envionment variables in the current shell via source ~/.zshrc
or source ~/.bashrc
Check Spark Version:
spark-submit --version
spark-shell --version
spark-sql --version
version 3.5.1
Test Apache Spark
To test our setup, let's create a simple Spark application using Python. Create a new file called spark_demo_app.py and add the following code:
spark_demo_app.py
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import avg, sum, count, max
# Initialize a Spark session
spark = SparkSession.builder.appName("SparkDemoApp").getOrCreate()
# Create a Dummy Dataframe
df = spark.createDataFrame([
Row(name='Robert', loc='Berlin', sal=5000), Row(name='Peter', loc='Frankfurt', sal=6000),
Row(name='Harry', loc='Dresden', sal=4000),Row(name='Sunny', loc='Berlin', sal=4800),
Row(name='Sam', loc='Dresden', sal=3200),Row(name='Roger', loc='Berlin', sal=4700),
], schema = 'name string, location string, salary integer'
)
df.groupBy('location').agg(avg('salary').alias('avg_sal'),sum('salary').alias('sum_sal'),
max('salary').alias('max_sal'), count('salary').alias("people_count")).show()
# Stop the Spark session
spark.stop()
Run this application using:
spark-submit spark_demo_app.py
Install Jupyter Lab
Finally, let's install Jupyter Lab to create and run notebooks:
brew install jupyterlab
Test Jupyter Lab
Create a new directory called ~/spark_notebooks and navigate into it. Then, start Jupyter Lab using:
mkdir ~/spark_notebooks
cd ~/spark_notebooks
# jupyter lab
jupyter lab --notebook-dir=~/spark_notebooks --preferred-dir ~/spark_notebooks
Access the server at http://localhost:8888/lab to create and run notebooks.
Add a new Notebook and Test the environment.
Spark UI is accessible at http://127.0.0.1:4040/
In this article, we've walked you through the process of setting up Apache Spark with Jupyter Notebook on MacOS. With these steps, you should now have a fully functional Spark environment ready for use. Happy coding!