Advent of 2021, Day 15 – Introduction to Spark Streaming

Posted on December 15, 2021 by tomaztsql — 14 Comments

Series of Apache Spark posts:

Dec 01: What is Apache Spark
Dec 02: Installing Apache Spark
Dec 03: Getting around CLI and WEB UI in Apache Spark
Dec 04: Spark Architecture – Local and cluster mode
Dec 05: Setting up Spark Cluster
Dec 06: Setting up IDE
Dec 07: Starting Spark with R and Python
Dec 08: Creating RDD files
Dec 09: RDD Operations
Dec 10: Working with data frames
Dec 11: Working with packages and spark DataFrames
Dec 12: Spark SQL
Dec 13: Spark SQL Bucketing and partitioning
Dec 14: Spark SQL query hints and executions

Spark Streaming or Structured Streaming is a scalable and fault-tolerant, end-to-end stream processing engine. it is built on the Spark SQL engine. Spark SQL engine will is responsible for running results sets for streaming data, regardless of static or continuously in coming stream data.

Spark stream can use Dataframe (or Datasets) API in Scala, Python, R or Java to work on handling data ingest, creating streaming analytics and do all the computations. All these requests and workloads are done against Spark SQL engine.

Spark SQL engine for the structured streaming queries have undergo some changes with Spark 2.3 and now uses low-latency processing mode called continuous processing. This mode is capable of achieving end-to-end low latency times (as low as 1 millisecond per changes or query operations on dataframe/dataset)

Quick setup using R

Assuming that you have all the installation completed, and we start with starting the master cluster.

Before starting, we will need to run Netcat (nc) server and we can start the localhost. Netcat is s a command-line utility that reads and writes data across network connections, using the TCP or UDP protocols. And this will generate and mimic the streaming data. To run the Netcat server, run the following CLI commnand (server: localhost; port: 9999):

nc -lk 9999

Using R we will connect to master and create a session.

library(SparkR)
sparkR.session(appName = "StructuredStreamApp")

And we will define a dataframe, where we want to store the streaming data

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines <- read.stream("socket", host = "localhost", port = 9999)

# Split the lines into words
words <- selectExpr(lines, "explode(split(value, ' ')) as word")

# Generate running word count
wordCounts <- count(group_by(words, "word"))

Copy paste this script in R file (name it: Stream-word-count.R):

library(SparkR)
sparkR.session(appName = "StructuredStreamApp")

hostname <- args[[1]]
port <- as.integer(args[[2]])
lines <- read.stream("socket", host = hostname, port = port)

words <- selectExpr(lines, "explode(split(value, ' ')) as word")

wordCounts <- count(groupBy(words, "word"))

query <- write.stream(wordCounts, "console", outputMode = "complete")
awaitTermination(query)
sparkR.session.stop()

And run this script from CLI using spark-submit bash and push it to localhost on port 9999, that you have already started using nc:

/bin/spark-submit /Rsample/Stream-word-count.R localhost 9999

Quick setup using Python

Similar to R, you can do this with Python (or Scala) as well. So assuming that you already have Spark SQL engine installed and nc is up and running on localhost with port 9999.

Create a Python file (Stream-word-count.py) and copy the content:

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: Stream-word-count.py <hostname> <port>", file=sys.stderr)
        sys.exit(-1)

    host = sys.argv[1]
    port = int(sys.argv[2])

    spark = SparkSession\
        .builder\
        .appName("StructuredStreamApp")\
        .getOrCreate()

    lines = spark\
        .readStream\
        .format('socket')\
        .option('host', host)\
        .option('port', port)\
        .load()

    # Split the lines into words
    words = lines.select(
        # explode turns each item in an array into a separate row
        explode(
            split(lines.value, ' ')
        ).alias('word')
    )

    wordCounts = words.groupBy('word').count()

    query = wordCounts\
        .writeStream\
        .outputMode('complete')\
        .format('console')\
        .start()

    query.awaitTermination()

And run the following file from CLI:

/bin/spark-submit /Pysample/Stream-word-count.py localhost 9999

In both cases, you should be getting the results back in dataset/dataframe that is ready to be analysed.

Tomorrow we will look into dataframe operations for Spark streaming.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

Tagged with: Apache Spark, nc, Python, R, Scala, Spark, streaming
Posted in Spark, Uncategorized

14 comments on “Advent of 2021, Day 15 – Introduction to Spark Streaming”

Advent of 2021, Day 14 – Introduction to Spark Streaming – Data Science Austria says:

December 16, 2021 at 12:42 am

[…] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

LikeLike

Reply
Advent of 2021, Day 16 – Dataframe operations for Spark streaming | TomazTsql says:

December 16, 2021 at 10:17 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Trying out Spark Streaming – Curated SQL says:

December 17, 2021 at 2:15 pm

[…] Tomaz Kastrun continues a series on Spark. Part 15 provides an introduction to Spark Streaming: […]

LikeLike

Reply
Advent of 2021, Day 17 – Watermarking and joins for Spark streaming | TomazTsql says:

December 17, 2021 at 9:16 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
rana says:

December 18, 2021 at 9:33 pm

how to stop structured spark streaming query
…if any help pleas

LikeLike

Reply
- tomaztsql says:
  
  December 19, 2021 at 5:33 am
  
  Hi,
  You can always start stopping the session, the context and at the end the master.
  To stop the session:
  Python:
  session = SparkSession.builder.appName(“MyApp”).getOrCreate()
  session.stop()
  R: sparkR.session.stop()
  To stop the context: streamingContext.stop()
  to stop all (master and worker): sbin/stop-all.sh
  
  LikeLike
  
  Reply
Advent of 2021, Day 18 – Time windows for Spark streaming | TomazTsql says:

December 18, 2021 at 9:58 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 19 – Data engineering for Spark Streaming | TomazTsql says:

December 19, 2021 at 9:51 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 20 – Spark GraphX processing | TomazTsql says:

December 20, 2021 at 8:46 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 21 – Spark GraphX operators | TomazTsql says:

December 21, 2021 at 9:16 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 22 – Spark in Azure Databricks | TomazTsql says:

December 22, 2021 at 8:35 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 23 – Delta live tables with Azure Databricks | TomazTsql says:

December 23, 2021 at 6:48 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 24 – Data Visualisation with Spark | TomazTsql says:

December 24, 2021 at 1:49 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply
Advent of 2021, Day 25 – Spark literature, documentation, courses and books | TomazTsql says:

December 25, 2021 at 12:55 pm

[…] Dec 15: Introduction to Spark Streaming […]

LikeLike

Reply

	tomaztsql on Retrieving user access list to…
	Paola A Zambrano on Retrieving user access list to…
	“Reverse Hello… on Little useless-useful R functi…
	Max Petter on Using R and Python in Microsof…
	detlef kissel on Using R and Python in Microsof…

Advent of 2021, Day 15 – Introduction to Spark Streaming

Quick setup using R

Quick setup using Python

Share this:

Related

14 comments on “Advent of 2021, Day 15 – Introduction to Spark Streaming”

Leave a comment Cancel reply