Advent of 2021, Day 8 – Creating RDD files

Posted on December 8, 2021 by tomaztsql — 19 Comments

Series of Apache Spark posts:

Dec 01: What is Apache Spark
Dec 02: Installing Apache Spark
Dec 03: Getting around CLI and WEB UI in Apache Spark
Dec 04: Spark Architecture – Local and cluster mode
Dec 05: Setting up Spark Cluster
Dec 06: Setting up IDE
Dec 07: Starting Spark with R and Python

Spark is created around the concept of resilient distributed datasets (RDD). RDD is a fault-tolerant collection of files that can be used in parallel. RDDs can be created in two ways:
– parallelising an existing data collection in driver program
– referencing a datasets in external storage (HDFS, blob storage, shared filesystem, Hadoop InputFormat,…)

In a simple way, Spark RDD has two opeartions:
– transformations – creating a new RDD dataset on top of already existing one with the last transformation
– actions – to the action, and return a value to the driver program after running a computation on the dataset.

Map and reduce is where a map is a transformation that uses a function on each dataset and returns a new RDD file that holds the result. Reduce is a action that aggregates all the elements of RDD using a function and returns the result from last created RDD to driver program.

By default, each transformed of the RDD can be recomputed each time you run an action on it. But you can also persist RDD in the memory, and makes faster access to data, because it is cached.

Spark RDD (Low Level API) Basics using Pyspark | by Sercan Karagoz | Analytics Vidhya | Medium

Using Python, we can run the cluster and start tinkering with a simple file (using my Advent of Code input data puzzle for day 7 , because 🙂 )

from pyspark.sql import SparkSession

spark:SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("UsingAoCData")
      .getOrCreate()

And we prepare for parallelisation of the RDD:

data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd=spark.sparkContext.parallelize(data)

These files can be created on many different platforms (HDFS, local file,…) and will still have the same characteristics.

Furthermore, you can also create an empty RDD file and later populate it, you can partition the RDD files, create the whole text files and many other options.

When you use the methods: parallelize(), textFile() or wholeTextFiles() methods to store data into RDD, these will be automatically split into partitions (with limitations of resources available). Number of partitions – as we have already discussed – will be based upon the number of cores available in the system.

Tomorrow we will look into RDD operations (transformations and actions) 🙂

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

Tagged with: dataframe, Datasets, Python, RDD, Spark
Posted in Spark, Uncategorized

19 comments on “Advent of 2021, Day 8 – Creating RDD files”

Advent of 2021, Day 9 – RDD Operations | TomazTsql says:

December 9, 2021 at 4:25 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 10 – Working with data frames | TomazTsql says:

December 10, 2021 at 5:15 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 11 – Working with packages and spark dataFrames | TomazTsql says:

December 11, 2021 at 4:47 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 12 – Spark SQL | TomazTsql says:

December 12, 2021 at 6:17 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 13 – Spark SQL bucketing and partitioning | TomazTsql says:

December 13, 2021 at 6:13 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 13 – Spark SQL bucketing and partitioning | TomazTsql says:

December 14, 2021 at 4:40 am

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 14 – Spark SQL query hints and executions | TomazTsql says:

December 14, 2021 at 7:50 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 14 – Introduction to Spark Streaming | TomazTsql says:

December 15, 2021 at 9:31 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 15 – Introduction to Spark Streaming | TomazTsql says:

December 16, 2021 at 4:56 am

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 16 – Dataframe operations for Spark streaming | TomazTsql says:

December 16, 2021 at 10:17 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 17 – Watermarking and joins for Spark streaming | TomazTsql says:

December 17, 2021 at 9:16 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 18 – Time windows for Spark streaming | TomazTsql says:

December 18, 2021 at 9:58 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 19 – Data engineering for Spark Streaming | TomazTsql says:

December 19, 2021 at 9:51 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 20 – Spark GraphX processing | TomazTsql says:

December 20, 2021 at 8:46 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 21 – Spark GraphX operators | TomazTsql says:

December 21, 2021 at 9:16 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 22 – Spark in Azure Databricks | TomazTsql says:

December 22, 2021 at 8:35 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 23 – Delta live tables with Azure Databricks | TomazTsql says:

December 23, 2021 at 6:47 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 24 – Data Visualisation with Spark | TomazTsql says:

December 24, 2021 at 1:49 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply
Advent of 2021, Day 25 – Spark literature, documentation, courses and books | TomazTsql says:

December 25, 2021 at 12:55 pm

[…] Dec 08: Creating RDD files […]

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…

Advent of 2021, Day 8 – Creating RDD files

Share this:

Related

19 comments on “Advent of 2021, Day 8 – Creating RDD files”

Leave a comment Cancel reply