Advent of 2021, Day 4 – Spark Architecture – Local and cluster mode

Series of Apache Spark posts:

Before diving into IDE, let’s see what kind of architecture is available in Apache Spark.

Architecture

Finding the best way to write Spark will be dependent of the language flavour. As we have mentioned, Spark runs both on Windows and Mac OS or Linux (both UNIX-like systems). And you will need Java installed to run the clusters. Spark runs on Java 8/11, Scala 2.12, Python 2.7+/3.4+ and R 3.1+. And the language flavour can also determine which IDE will be used.

Spark comes with several sample scripts available and you can run them simply by heading to CLI and calling for example the following commands for R or Python:

R:

sparkR --master local[2]
spark-submit examples/src/main/r/dataframe.R

And for Python:

pyspark --master local[2]
spark-submit examples/src/main/python/pi.py 10

But each time, we need to initialize and run the cluster in order to have commands up and running.

Running in local mode and running in a cluster

Local mode

The Spark cluster mode is available immediately upon running the shell. Simply run sc and you will get the context information:

In addition, you can run also:

sc.local
sc.master

And you will receive the information about the context and execution mode.

Local mode is the default mode and does not require any resource management. When you start spark-shell command, it is already up and running. Local mode are also good for testing purposes, quick setup scenarios and have number of partitions equals to number of CPU on local machine. You can start in local mode with any of the following commands:

spark-shell
spark-shell --master local
spark-shell -- master local[*]
spark-shell -- master local[3]

By the default the spark-shell will execute in local mode, and you can specify the master argument with local attribute with how many threads you want Spark application to be running; remember, Spark is optimised for parallel computation. Spark in local mode will run with single thread. With passing the number of CPU to local attribute, you can execute in multi-threaded computation.

Cluster Mode

When it comes to cluster mode, consider the following components:

When running Spark in cluster mode, we can refer to as running spark application in set of processes on a cluster(s), that are coordinated by driver. The driver program is using the SparkContext object to connect to different types of cluster managers. These managers can be:
– Standalone cluster manager (Spark’s own manager that is deployed on private cluster)
– Apache Mesos cluster manager
– Hadoop YARN cluster manager
– Kubernetes cluster manager.

Cluster manager is responsible to allocate the resources across the Spark Application. This architecture has several advantages. Each application run is isolated from other application run, because each gets its own executor process. Driver schedules its own tasks and executes it in different application run on different JVM. Downside is, that data can not be shared across different Spark applications, without being written (RDD) to a storage system, that is outside of this particular application

Running Spark in cluster mode, we will need to run with spark-submit command and not spark-shell command. The general code is:

spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

And commonly used are:
--class: The entry point for your application
--master: The master URL for the cluster
--deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client

Simple spark-submit --help will get you additional information:

Tomorrow we will look into the Spark-submit and cluster installation.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

Tagged with: , , , , , ,
Posted in Spark, Uncategorized
24 comments on “Advent of 2021, Day 4 – Spark Architecture – Local and cluster mode
  1. […] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

    Like

  2. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  3. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  4. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  5. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  6. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  7. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  8. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  9. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  10. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  11. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  12. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  13. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  14. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  15. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  16. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  17. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  18. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  19. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  20. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  21. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  22. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

  23. […] Dec 04: Spark Architecture – Local and cluster mode […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...