Advent of 2021, Day 12 – Spark SQL

Posted on December 12, 2021 by tomaztsql — 14 Comments

Series of Apache Spark posts:

Dec 01: What is Apache Spark
Dec 02: Installing Apache Spark
Dec 03: Getting around CLI and WEB UI in Apache Spark
Dec 04: Spark Architecture – Local and cluster mode
Dec 05: Setting up Spark Cluster
Dec 06: Setting up IDE
Dec 07: Starting Spark with R and Python
Dec 08: Creating RDD files
Dec 09: RDD Operations
Dec 10: Working with data frames
Dec 11: Working with packages and spark DataFrames

Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.

These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.

With API unification, user can access Spark SQL using Scala spark-shell, using Python pyspark or using R sparkR shell.

Loading data – comparison SQL, R, Python

From previous blogpost, we will use the Parquet file (more info on this file format: here) to see the comparison importing / loading data. The content of the file looks like and is directly available here:

PAR1"&,@AlyssaBen,
0red88,
@	\Hexample.avro.User%name%%favorite_color%5favorite_numbers%array<&%nameDH&&P5favorite_color<@&P&ê%(favorite_numbersarray
ZZ&ê⁄avro.schema⁄{"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string"},{"name":"favorite_color","type":["string","null"]},{"name":"favorite_numbers","type":{"type":"array","items":"int"}}]}parquet-mr version 1.4.3ÍPAR1

For Python:

df = spark.read.parquet("examples/src/main/resources/users.parquet")
(df.write.format("parquet")
    .option("parquet.bloom.filter.enabled#favorite_color", "true")
    .option("parquet.bloom.filter.expected.ndv#favorite_color", "1000000")
    .option("parquet.enable.dictionary", "true")
    .option("parquet.page.write-checksum.enabled", "false")
    .save("users_with_options.parquet"))

For R:

df <- read.df("examples/src/main/resources/users.parquet", "parquet")
write.parquet(df, "users_with_options.parquet", parquet.bloom.filter.enabled#favorite_color = true, parquet.bloom.filter.expected.ndv#favorite_color = 1000000, parquet.enable.dictionary = true, parquet.page.write-checksum.enabled = false)

And using SQL:

CREATE TABLE users_with_options (
  name STRING,
  favorite_color STRING,
  favorite_numbers array<integer>
) USING parquet
OPTIONS (
  `parquet.bloom.filter.enabled#favorite_color` true,
  `parquet.bloom.filter.expected.ndv#favorite_color` 1000000,
  parquet.enable.dictionary true,
  parquet.page.write-checksum.enabled true
)

The same file can be directly read from the parquet format, without persisting the content.

For Python:

df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

And for R:

df <- sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

Tomorrow we will look into further SQL bucketing and partitioning.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

Tagged with: cluster, Hive, Parquet, Python, R, Scala, Spark, SQL, SQL Server
Posted in Spark, Uncategorized

14 comments on “Advent of 2021, Day 12 – Spark SQL”

Advent of 2021, Day 12 – Spark SQL – Data Science Austria says:

December 12, 2021 at 7:39 pm

[…] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Reply
Advent of 2021, Day 13 – Spark SQL bucketing and partitioning | TomazTsql says:

December 13, 2021 at 6:13 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 14 – Spark SQL query hints and executions | TomazTsql says:

December 14, 2021 at 7:50 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 14 – Introduction to Spark Streaming | TomazTsql says:

December 15, 2021 at 9:31 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 16 – Dataframe operations for Spark streaming | TomazTsql says:

December 16, 2021 at 10:17 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 17 – Watermarking and joins for Spark streaming | TomazTsql says:

December 17, 2021 at 9:16 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 18 – Time windows for Spark streaming | TomazTsql says:

December 18, 2021 at 9:58 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 19 – Data engineering for Spark Streaming | TomazTsql says:

December 19, 2021 at 9:51 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 20 – Spark GraphX processing | TomazTsql says:

December 20, 2021 at 8:46 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 21 – Spark GraphX operators | TomazTsql says:

December 21, 2021 at 9:16 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 22 – Spark in Azure Databricks | TomazTsql says:

December 22, 2021 at 8:35 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 23 – Delta live tables with Azure Databricks | TomazTsql says:

December 23, 2021 at 6:47 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 24 – Data Visualisation with Spark | TomazTsql says:

December 24, 2021 at 1:49 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply
Advent of 2021, Day 25 – Spark literature, documentation, courses and books | TomazTsql says:

December 25, 2021 at 12:55 pm

[…] Dec 12: Spark SQL […]

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…

Advent of 2021, Day 12 – Spark SQL

Loading data – comparison SQL, R, Python

Share this:

Related

14 comments on “Advent of 2021, Day 12 – Spark SQL”

Leave a comment Cancel reply