Advent of 2021, Day 15 – Introduction to Spark Streaming

Series of Apache Spark posts:

Spark Streaming or Structured Streaming is a scalable and fault-tolerant, end-to-end stream processing engine. it is built on the Spark SQL engine. Spark SQL engine will is responsible for running results sets for streaming data, regardless of static or continuously in coming stream data.

Spark stream can use Dataframe (or Datasets) API in Scala, Python, R or Java to work on handling data ingest, creating streaming analytics and do all the computations. All these requests and workloads are done against Spark SQL engine.

Spark SQL engine for the structured streaming queries have undergo some changes with Spark 2.3 and now uses low-latency processing mode called continuous processing. This mode is capable of achieving end-to-end low latency times (as low as 1 millisecond per changes or query operations on dataframe/dataset)

Quick setup using R

Assuming that you have all the installation completed, and we start with starting the master cluster.

Before starting, we will need to run Netcat (nc) server and we can start the localhost. Netcat is s a command-line utility that reads and writes data across network connections, using the TCP or UDP protocols. And this will generate and mimic the streaming data. To run the Netcat server, run the following CLI commnand (server: localhost; port: 9999):

nc -lk 9999

Using R we will connect to master and create a session.

library(SparkR)
sparkR.session(appName = "StructuredStreamApp")

And we will define a dataframe, where we want to store the streaming data

# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines <- read.stream("socket", host = "localhost", port = 9999)

# Split the lines into words
words <- selectExpr(lines, "explode(split(value, ' ')) as word")

# Generate running word count
wordCounts <- count(group_by(words, "word"))

Copy paste this script in R file (name it: Stream-word-count.R):

library(SparkR)
sparkR.session(appName = "StructuredStreamApp")

hostname <- args[[1]]
port <- as.integer(args[[2]])
lines <- read.stream("socket", host = hostname, port = port)

words <- selectExpr(lines, "explode(split(value, ' ')) as word")

wordCounts <- count(groupBy(words, "word"))

query <- write.stream(wordCounts, "console", outputMode = "complete")
awaitTermination(query)
sparkR.session.stop()

And run this script from CLI using spark-submit bash and push it to localhost on port 9999, that you have already started using nc:

/bin/spark-submit /Rsample/Stream-word-count.R localhost 9999

Quick setup using Python

Similar to R, you can do this with Python (or Scala) as well. So assuming that you already have Spark SQL engine installed and nc is up and running on localhost with port 9999.

Create a Python file (Stream-word-count.py) and copy the content:

import sys

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: Stream-word-count.py <hostname> <port>", file=sys.stderr)
        sys.exit(-1)

    host = sys.argv[1]
    port = int(sys.argv[2])

    spark = SparkSession\
        .builder\
        .appName("StructuredStreamApp")\
        .getOrCreate()

    lines = spark\
        .readStream\
        .format('socket')\
        .option('host', host)\
        .option('port', port)\
        .load()

    # Split the lines into words
    words = lines.select(
        # explode turns each item in an array into a separate row
        explode(
            split(lines.value, ' ')
        ).alias('word')
    )

    wordCounts = words.groupBy('word').count()

    query = wordCounts\
        .writeStream\
        .outputMode('complete')\
        .format('console')\
        .start()

    query.awaitTermination()

And run the following file from CLI:

/bin/spark-submit /Pysample/Stream-word-count.py localhost 9999

In both cases, you should be getting the results back in dataset/dataframe that is ready to be analysed.

Tomorrow we will look into dataframe operations for Spark streaming.

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Spark-for-data-engineers

Happy Spark Advent of 2021! 🙂

Tagged with: , , , , , ,
Posted in Spark, Uncategorized
14 comments on “Advent of 2021, Day 15 – Introduction to Spark Streaming
  1. […] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

    Like

  2. […] Tomaz Kastrun continues a series on Spark. Part 15 provides an introduction to Spark Streaming: […]

    Like

  3. rana says:

    how to stop structured spark streaming query
    …if any help pleas

    Like

    • tomaztsql says:

      Hi,
      You can always start stopping the session, the context and at the end the master.
      To stop the session:
      Python:
      session = SparkSession.builder.appName(“MyApp”).getOrCreate()
      session.stop()
      R: sparkR.session.stop()
      To stop the context: streamingContext.stop()
      to stop all (master and worker): sbin/stop-all.sh

      Like

  4. […] Dec 15: Introduction to Spark Streaming […]

    Like

  5. […] Dec 15: Introduction to Spark Streaming […]

    Like

  6. […] Dec 15: Introduction to Spark Streaming […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...