Advent of 2020, Day 27 – Connecting Azure Databricks with on premise environment

Series of Azure Databricks posts:

Yesterday we connected the Azure Machine Learning services and Azure Databricks workspace for tracking and monitoring experiments and models in Azure Machine Learning.

Today we will connect on premise development environment (R or Python) with resources in Azure Databricks. With other words, we will have running code on a client / on premise machine and pushing all the workload to the cloud.

1.Connecting with R-Studio and SparkR

Launch your Azure Databricks workspace from Azure portal. Once the Azure Databricks is launched, head to Clusters.

Start the cluster you will be using to connect to R Studio. Go to “Apps” tab under clusters:

Before you click the “Set up RStudio” button, you will need to disable the auto termination option. By default is enabled, making a cluster terminate itself after period of time of inactivity. Under configuration, select Edit and disable the termination. Cluster will restart. And then click the set up Rstudio. Beware, you should not stop the cluster yourself after finishing work (!).

Click Set up RStudio and you will get the following credentials:

And click the “Open RStudio” and you will get redirected to web portal with RStudio opening.

In order to get the Databricks cluster objects into R Studio, you must also run the spark_connect:

SparkR::sparkR.session()
library(sparklyr)
sc <- spark_connect(method = "databricks")

And you will see all the DataFrames or CSV Files from previous days:

Please note, you can also connect to RStudio desktop version (!). If so, you would, there are the following steps:

Open your RStudio Desktop and install:

install.packages("devtools")
devtools::install_github("sparklyr/sparklyr")

Install Databricks-connect in CLI (it is a 250Mb Package):

pip uninstall pyspark
pip install -U databricks-connect

Now set the connections to Azure Databricks:

databricks-connect get-jar-dir

And after that run the command in CLI:

databricks-connect configure

CLI will look like a text input:

And all the information you need to fill in the the CLI can find in URL:

Databrick host: adb-860xxxxxxxxxx95.15.azuredatabricks.net

Cluster Name: 1128-xxxx-yaws18/

Organization: 860xxxxxxxxxx95

Port: 15001

Token: /////

2.Connecting with Python

Go to your Cluster in Azure Databricks and straight to configure.

Cluster will need to have these two items added in the Advanced Options -> Spark Config section. Set these values to a cluster that you want to connect to from on-premise or client machine.

spark.databricks.service.server.enabled true
spark.databricks.service.port 8787

It is a key-pair value so there must be a space between the key and it’s value. Once you save this, the cluster will need to restart (click “Confirm & Restart)”!

You can create a virtual environment; it is up to you. I wil create a new environment for Databricks connections and python. In your CLI run the following command:

conda create --name databricksconnect python=3.7

And activate the environment:

conda activate databricksconnect

If you are using an existing Python environment, I strongly suggest you to first uninstall PySpark. This is due the fact that databricks-connect package will have it’s own version of PySpark (same as with R).

Now install the Databricks-Connect package. You can alos specify the version of databricks-connect by adding =5.2 to have like pip install -U databricks-connect=5.2. This depends on your Datatricks cluster version

#pip uninstall pyspark
pip install -U databricks-connect

After the installation, you will need to run the configurations again:

databricks-connect configure

Adding all the needed settings as explained in R section. And test your connection by using:

databricks-connect test

Once you have all the settings enabled you are ready to use Databricks on your on-premis / client machine.

You have many ways to use it:

  • Anaconda notebooks / Jupyter notebooks
  • PyCharm
  • Atom
  • Visual Studio Code
  • etc.

In Visual Studio Code (lightweight, open-source, multi-platform) you can set up this connection. Make sure you have a Python extension installed.

and run the following script:

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, col
spark = SparkSession.builder.getOrCreate()
# Import
data = spark.read.format("csv").option("header", "true").load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

#Display
display(data)

And once you run this, you should have the results.

Make sure you also choose the correct Python interpreter in Visual Studio Code.

3. Connecting to Azure Databricks with ODBC

You can also connect Azure Databricks SQL tables using ODBC to your on-premise Excel or to Python or to R.

It will only see the SQL tables and connections. but it can also be done. This will require some ODBC installation, but I will not go into it.

Tomorrow we will look into Infrastructure as Code and how to automate, script and deploy Azure Databricks.

Complete set of code and the Notebook is available at the Github repository.

Happy Coding and Stay Healthy!

Tagged with: , , , , , , , , ,
Posted in Azure Databricks, Uncategorized
7 comments on “Advent of 2020, Day 27 – Connecting Azure Databricks with on premise environment
  1. […] by data_admin [This article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

    Like

  2. […] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page […]

    Like

  3. […] article was first published on R – TomazTsql, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]

    Liked by 1 person

  4. […] Dec 27: Connecting Azure Databricks with on premise environment […]

    Liked by 1 person

  5. […] Dec 27: Connecting Azure Databricks with on premise environment […]

    Liked by 1 person

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...