Let’s learn how to do Apache Spark Installation on Linux based Ubuntu server, same steps can be used to setup Centos, Debian e.t.c. In real-time all Spark application runs on
By Nick Cotes
Let’s learn how to do Apache Spark Installation on Linux based Ubuntu server, same steps can be used to setup Centos, Debian e.t.c. In real-time all Spark application runs on Linux based OS hence it is good to have knowledge on how to Install and run Spark applications on some Unix based OS like Ubuntu server.
Though this article explains with Ubuntu, you can follow these steps to install Spark on any Linux-based OS like Centos, Debian e.t.c, I followed the below steps to setup my Apache Spark cluster on Ubuntu server.
If you just wanted to run Spark in standalone, proceed with this article.
Java Installation On Ubuntu
Apache Spark is written in Scala which is a language of Java hence to run Spark you need to have Java Installed. Since Oracle Java is licensed here I am using openJDK Java. If you wanted to use Java from other vendors or Oracle please do so. Here I will be using JDK 8.
sudo apt-get -y install openjdk-8-jdk-headless
Post JDK install, check if it installed successfully by running java -version
Python Installation On Ubuntu
You can skip this section if you wanted to run Spark with Scala & Java on an Ubuntu server.
Python Installation is needed if you wanted to run PySpark examples (Spark with Python) on the Ubuntu server.
sudo apt install python3
Apache Spark Installation on Ubuntu
In order to install Apache Spark on Linux based Ubuntu, access Apache Spark Download site and go to the Download Apache Spark section and click on the link from point 3, this takes you to the page with mirror URL’s to download. copy the link from one of the mirror site.
If you wanted to use a different version of Spark & Hadoop, select the one you wanted from the drop-down (point 1 and 2); the link on point 3 changes to the selected version and provides you with an updated link to download.
Use wget command to download the Apache Spark to your Ubuntu server.
Apache Spark binary comes with an interactive spark-shell. In order to start a shell to use Scala language, go to your $SPARK_HOME/bin directory and type “spark-shell“. This command loads the Spark and displays what version of Spark you are using.
Note: In spark-shell you can run only Spark with Scala. In order to run PySpark, you need to open pyspark shell by running $SPARK_HOME/bin/pyspark . Make sure you have Python installed before running pyspark shell.
By default, spark-shell provides with spark (SparkSession) and sc (SparkContext) object’s to use. Let’s see some examples.
Apache Spark provides a suite of Web UIs (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark application, resource consumption of Spark cluster, and Spark configurations. On Spark Web UI, you can see how the Spark Actions and Transformation operations are executed. You can access by opening http://ip-address:4040/. replace ip-address with your server IP.
Create $SPARK_HOME/conf/spark-defaults.conf file and add below configurations.
# Enable to store the event log
#Location where to store event log
#Location from where history server to read event log
Create Spark Event Log directory. Spark keeps logs for all applications you submitted.