Eight R Tidyverse tips for everyday data engineering

Tidyverse is a collection of R packages, primarily for data engineering and analytics. These packages are ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats. And all combine the same language, design and “grammar” structures.

Collection of Tidyverse resources. Source: Tidyverse

1. Use pipe “%>%” for nesting functions within the pipe

Piping (or chaining) is a great way to link the data manipulation functions without storing intermediate results.

# inner piping inside mutate
airlines %>%
mutate(name_short = name %>%
     str_to_upper() %>%
     str_replace_all (" (INC|CO)\\.?$", "") %>%
     str_replace_all (" AIR ?(LINES|WAYS)?( CORPORATION)?$", "") %>%
     str_to_title() %>%
     str_replace_all("\\bUs\b", "US")
      ) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

This code is clearly readable, understandable and easy to curate, despite the heavy use of REGEX and str_replace_all function. If we would replace the inner piping with normal function wrapping, the mutate for “name_short” would be fairly unreadable.

# without inner piping
airlines %>%
mutate(name_short = str_replace_all( str_to_title(                                                   str_replace_all(str_replace_all(str_to_upper(name)," (INC|CO)\\.?$", "")," AIR ?(LINES|WAYS)?( CORPORATION)?$", "")),"\\bUs\b","US")) %>%
mutate(FullName_length = nchar(name)) %>%
select (name_short, FullName_length) %>%
arrange(desc(FullName_length))

Both code snippets return the same result and use essentially the same functions. Whereas, the first one intelligently does this, using pipes.

2. Operations across multiple columns at once

Tidyverse provides a handful of great functions for operating across multiple columns simultaneously. Across is a function, that makes it easy to apply the same transformation over numerous columns in summarise() and mutate() functions.

Across accepts two arguments; a) array of columns and b) function or list of functions to be applied to selected columns.

# Simple across with two selected columns
flights %>%
  mutate(across(c(dep_delay, arr_delay), abs))

# Using function where to select columns
flights %>%
  summarise(across(where(is.integer), n_distinct))

# Using where and purr-style lamba function
flights %>%
  summarise(across(where(is.integer), ~ sum(.x, na.rm = TRUE)))

# using list of functions, defining column names and removing NA
flights %>%
  group_by(carrier) %>%
  summarise(across(ends_with("time"), list(AVG = mean, SD = sd, GrandTotal= ~ sum(is.na(.x))),  na.rm=TRUE, .names = "{.col}.{.fn}")) %>%
 ungroup()

3. Case statement to create a column based on a condition

Creating case statements is often a required task and case_when() function enables simple, fast transformation. A new column is simply added on top of other tidyverse functions; such as group_by and count.

flights %>%
  group_by(carrier) %>%
  mutate(new_classification = case_when(
    (origin == "EWR") & (dep_delay <= 0) ~ "EWR with negative delay",
    (origin == "EWR") & (dep_delay > 0) ~ "EWR with positive delay",
    (origin == "JFK") ~ "Stats for JFK Airport",
    (origin == "LGA") &  (air_time <= 220) ~ "La Guardia under 6 hours flights",
    TRUE   ~ "La Guardia above 6 hours flights"
  )) %>%
  count(new_classification) %>%
  ungroup()

4. Using transmute

This function executes same functions as mutate — create new column. But transmute() adds new variable(s) and drop existing ones, where mutate() preserves the existing ones.

library(lubridate)

# Combination of mutate and select
flights %>%
 mutate(date = make_date(year, month, day), carrier, tailnum) %>%
 select(date, carrier, tailnum)

# using transmute
flights %>%
 transmute(date = make_date(year, month, day), carrier, tailnum)

The example above returns essentially the same result set, whereas transmute() is cleaner and easier to read. If one would remove the select() function from the first example, one would get the complete dataset, along with a new column.

5. Lumping levels

Lumping together levels for a given factor variable is an extremely powerful function when there are many levels but one is interested in only the top most frequent levels.

Given the vector of 15 letters and their frequency, one can immediately see, that out of 15 different letters, roughly 80% of the observations fall into 5 letters.

# using transmute
x <- factor(rep(LETTERS[1:15], times = c(20,15,23,2,4,3,1,1,1,5,2,8,3,1,1))) 
x %>% table()

And to achieve this, one can use fct_lump_n() function or fct_lump_min().

x %>% 
fct_lump_n(5) %>% 
table()

# or alternatively
# x %>% fct_lump_min(5) %>% table()

## result:
>   A     B     C     J     L Other 
>  20    15    23     5     8    19

Now displaying the long tail of different levels with low frequency is grouped into “other” bucket, great for visualising data.

# using ggplot to show top 5 frequent carriers and "other" group
flights %>%
  mutate(name = fct_lump_n(carrier, 5)) %>%
  count(name) %>%
  mutate(name = fct_reorder(name, n)) %>%
  ggplot(aes(x=name, y=n)) +
  geom_col()

Quick overview graph using fct_lump_n() function

6. Generating all possible combinations

Generating all the possible combinations out of a set of variables is not your everyday scenario. But Tidyverse offers the function called crossing() or expand() to do the job.

Following example will create all possible combinations for variables age, status, values and temperature.

# pseudo sample
crossing(
  age = c(30,40,50,60,70),
  status = c("New", "Used"),
  values = c("0-100EUR", "101-200E", "201-300", "301-400"),
  temperature = c(30,35,34)
)

Working with flights dataset, let’s create all possible combinations for couple of selected variables. One would use function expand().

# generating all possible combinations
flights %>% expand(origin, dest, dep_time, carrier)

But finding all existing and unique combinations

# getting all possible combinations that are present in dataset using crossing
flights %>%
  select(origin, dest, dep_time, carrier) %>%
  crossing()
# > 97,946 more rows

# getting all possible combinations using expand and nesting
flights %>% expand(nesting(origin, dest, dep_time,carrier))
# > 97,946 more rows

7. Reshaping data with pivot and spread

Another important task that will usually occur with data preparation is data reshaping.

First function one must (!) understand and be comfortable work with is pivot_wider(). This function takes the values from one variable (in this case variable: origin) and transpose (reshape / pivot) the data (dep_time and arr_time) for given statistics (average of values).

#pivot_wider
flights %>%
  group_by(carrier) %>%
  select(origin, dep_time, arr_time) %>%
  pivot_wider(
          names_from = origin, 
          values_from = c(dep_time,arr_time), 
          values_fn = ~mean(.x, na.rm = TRUE),
          names_glue = "{origin}_{.value}"
          )

#check calculation for carrier UA and origin EWR using summarise
flights %>%
  filter(carrier == 'UA' & origin == 'EWR') %>%
  group_by(carrier) %>%
    summarise(
    avg_dep_time = mean(dep_time, na.rm = TRUE)
  )

Contra to pivot_wider() is the function pivot_longer(). This function does the exact opposite. Wider function adds more columns and reduces the number of rows, longer function reduces the number of columns and creates new rows of data.

Let’s persist the dataframe from pivot_wider example and name the dataframe flights_wider.

#create and persist dataframe called: flights_wider
flights_wider <- flights %>%
  group_by(carrier) %>%
  select(origin, dep_time, arr_time) %>%
  pivot_wider(
    names_from = origin, 
    values_from = c(dep_time,arr_time), 
    values_fn = ~mean(.x, na.rm = TRUE),
    names_glue = "{origin}_{.value}"
  )

# excluding carrier
flights_wider %>%
  pivot_longer(
    !carrier,
    names_to = "origin",
    values_to = "time",
    values_drop_na = TRUE
  )

#or with cols parameter and defining the pattern for column selection
flights_wider %>%
  pivot_longer(
    cols = ends_with("time"),
    names_to = "origin",
    values_to = "time",
    values_drop_na = TRUE
  )

8. Adding a running ID to your dataframe

The last tip is the easiest and yet can save tons of time. Adding a running ID to each row of the dataframe. This ID will create “uniqueness” overall attributes (columns) for a given row. This can save you a lot of filtering and writing code, whereas, one can say, give me a row with ID = 42.

# Associate ID with every row in dataset
flights %>% 
 mutate(running_id = row_number())

# or using tibble row_to_column function
flights2 <- tibble::rowid_to_column(flights, ID)

This list is by no means the ultimate list and one could add many honourable mentions. But despite all that, the list has been curated over a longer period of time and this cherry-picking is based on work and the use of Tidyverse.

As always, the code is available at my Github repository.

Happy Tidyversing and stay healthy!

p.s.: Article was originally published on Medium.

Tagged with: , , , ,
Posted in Uncategorized, Useless R functions

Little useless-useful R functions – Plotting the decimal and binary conversion

How does the conversion between decimal to binary or from binary to decimal behave? With another useless function, I have plotted the points (x = decimal number, y = converted binary number) on a scatter plot. Just to find out that the graph shows the binomial distribution function. 🙂

First we create a useless function to convert from decimal to binary base:

dec2bin <- function(dec){
  dec_start <- dec
  str <- ''
  while (dec > 0) {
      if ((dec %% 2)==1){
        str <- paste0(str, '1')
      } else { #((dec %% 2)==0)
        str <- paste0(str, '0')
      }
    dec <- floor(dec/2)
  }
  splits <- strsplit(str, "")[[1]]
  reversed <- rev(splits)
  f_str <- paste(reversed, collapse = "")
  #return(paste("Decimal", dec_start, "is ",f_str," in binary"))
  return(f_str)
}

After we have the conversion function, we can create a dataset, that will hold the conversion results and the length of the binary number.

### Draw scatter plot for the conversion
df <- data.frame(dec_x = 1, bin_y = 1, digit_length=1)

for (i in 2:100){
    d <- c(dec_x = i, bin_Y = dec2bin(i), nchar(dec2bin(i)))
   df <-  rbind(df,d)
}

The data.frame can be created either way, from binary to decimal, respectively. The length of the binary number will be used to create groups of numbers for the scatterplot.

And finally, we create a scatter plot and line chart with smoothing line with the code:

library(ggplot2)
library(cowplot)
line <- ggplot(df, aes(x=dec_x, y=bin_y, colour=digit_length)) + geom_line()
line2 <- ggplot(df, aes(x=dec_x, y=bin_y)) + geom_line() + geom_smooth()
plot_grid(line, line2, labels = c("Length of binary digits", "Like a binominal distribution"))

And get final two graphs:

The left graph is the scatterplot with an x-axis of decimal numbers and the y-axis with binary numbers. The colour of the dots represents the length of the binary numbers. We can see that the steps appear with every circa every 20th number. This is when the length is increased and corresponding values are increased exponentially. The right graph shows the same behaviour of the conversion ( binary and decimal numbers) that clearly shows the similarities to the binomial distribution function. And the binomial distribution will appear regardless of the number of numbers converted.

As always, code is available on the Github in the same Useless_R_function repository. Check Github for future updates.

Happy R-coding and stay healthy!“

Tagged with: , , , ,
Posted in Uncategorized, Useless R functions

Comparing performances of CSV to RDS, Parquet, and Feather file formats in R

From the previous blogpost:
CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake

we have created Azure blob storage, connected secure connection using Python and started uploading files to blob store from SQL Server. Alongside, we compared the performance of different file types. ORC, AVRO, Parquet, CSV and Feather. Coming to conclusion, CSV is great for its readability, but not suitable (as a file format) for all types of workloads.

We will be doing a similar benchmark with R language. The goal is to see, if CSV file format can be replaced by a file type that better, both in performance and storage.

Creating a random dataset (10K, 1MIO rows with 30 columns) we have tested the speed of writing and speed of reading from local machine. For the benchmark, we have used package microbenchmark. File formats were: CSV, Feather, Parquet and Rdata (or RDS).

This is the result of running 10 times each operation (writing and reading) for the 10K rows.

The test was performed with R version 4.1.2 and M1 MacOS. For the 10.000 rows (with 30 columns), I have the average CSV size 3,3MiB and Feather and Parquet circa 1,4MiB, and less than 1MiB for RData and rds R format. By far the best reads and write times the tests show a big advantage with Feather file format, following with Parquet and data.table package (reading from CSV file).

Test was conducted using a sample dataframe:

#file names
file_csv <- 'test_df.csv'
file_csv_readr <- 'test_df_readr.csv'
file_csv_datatable <- 'test_df_datatable.csv'
file_feather <- 'test_df.feather'
file_rdata <- 'test_df.RData'
file_rds <- 'test_df.rds'
file_parquet <- 'test_df.parquet'

files <- file.info(c('test_df.csv','test_df_readr.csv','test_df_datatable.csv', 'test_df.feather', 'test_df.RData', 'test_df.rds', 'test_df.parquet'))

test_df <- data.frame(
                replicate(10, sample(0:10000, nof_rows, rep = TRUE)),
                replicate(10, runif(nof_rows, 0.0, 1.0)),
                replicate(10, stri_rand_strings(1000, 10))
                 )

and running benchmark:

benchmark_write <- data.frame(summary(microbenchmark(
          "test_df.csv"     = write.csv(test_df, file = file_csv),
          "test_df_readr.csv"     = readr::write_csv(test_df, file = file_csv_readr),
          "test_df_datatable.csv"     = data.table::fwrite(test_df, file = file_csv_datatable),
          "test_df.feather" = write_feather(test_df, file_feather),
          "test_df.parquet" = write_parquet(test_df, file_parquet),
          "test_df.rds"     = save(test_df, file = file_rdata),
          "test_df.RData"   = saveRDS(test_df, file_rds), 
  times = nof_repeat)))

Same was repeated for reading operation and results were merged:

benchmark_read <- data.frame(summary(microbenchmark(
                            "test_df.csv" = read.csv(file_csv),
                            "test_df_readr.csv" = readr::read_csv(file_csv_readr),
                            "test_df_datatable.csv" = data.table::fread(file_csv_datatable),
                            "test_df.rds"  = load(file_rdata),
                            "test_df.RData" = readRDS(file_rds),
                            "test_df.feather" = arrow::read_feather(file_feather),
                            "test_df.parquet" = read_parquet(file_parquet), times = nof_repeat)))

colnames(benchmark_read) <- c("names", "read_min", "read_lq", "read_mean", "read_median", "read_uq", "read_max", "read_repeat")


#merge results and create factors
results <- inner_join(inner_join(benchmark_read, files, by = "names"), benchmark_write, by = "names")
results <- results[,c("names","size_mb","read_min", "read_max", "read_median","write_min","write_max", "write_median")]

And finally, have plotted the results for the graph above:

library(glue)
library(ggtext)

title_lab_adjusted <- glue::glue(
  "File types comparison on<br><span style = 'color:red;'>read operation</span> and <br><span style='color:darkgreen';>write operation</span>") 

ggplot(results, aes(x=names, y=size_mb)) + 
     geom_bar(stat="identity", fill="lightblue") +
     geom_text(aes(label=paste0(format(round(size_mb, 2), nsmall = 2), " MiB", collapse=NULL)), vjust=-0.3, size=3.5)+
     theme(axis.text.x = element_text(angle = 45, hjust = 1.3)) +
     coord_cartesian(ylim = c(0, 5), expand = F) +
     scale_y_continuous(breaks = seq(0, 5, 1),labels = scales::label_comma(accuracy = 1)) +
    theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(size = 0.1, linetype = "solid", colour = "grey50")) +
    ylab(label = 'Time (sec.) + File_Size') +  xlab(label = 'Files') +
  labs(title = title_lab_adjusted) +
  theme(
    plot.title = element_markdown(),
    panel.background = element_rect(color = NA, fill = 'white')) +

  geom_point (aes(y=write_median/100, group=names),
        col = "darkgreen",
        size = 2,
        stat ="identity",
        alpha=.8) +
  geom_point(aes(y=read_median/100, group=names),
           col = "red",
           size = 2,
           stat ="identity",
           alpha=.8 ) 

To do a little bit more stress on the CPU, I have created a data.frame with 1Mio rows.

When comparing file size, we can see that CSV files are all around 330MiB in size, RData and RDS around 80MiB and Feather/Parquet around 140MiB. So CSV is already 2x bigger than Parquet and 4x bigger than RData file.

Looking into performance (median for write/read), we can see Feather is by far the most efficient file format. out of 10 runs, reading the complete dataset (1Mio rows), took only 0.4 seconds, whereas CSV was lagging behind significantly. With write operations, Feather’s file format was also the fastest.

Exploring the CSV file format, I have used three different R packages (base, readr and data.table), and data.table outperformed the other two significantly.

To conclude, storing data in feather format will yield performance statistics. If you want to save on storage space, use RData or rds, these two will save the most space. On the other side, if you are stuck with CSV file format, use data.table, as it is by far the fastest for data wrangling and data manipulation tasks.

As always, complete code is available at Github, in the benchmark repository. In the same repository, you will find also python code from the previous blog post, but if you are looking only for code, that is described in this blog post, click here. Follow up for more updates.

Happy R coding! And stay healthy!

Tagged with: , , , , , , ,
Posted in Uncategorized

CSV or alternatives? Exporting data from SQL Server data to ORC, AVRO, Parquet, Feather files and store them into Azure data lake

CSV data format is an old format and very common for data tasks, like import, export or storing. And when it comes performance of creating CSV file, reading and writing CSV files, how does it still stand against some other formats.

We will be looking at benchmarking the CRUD operations with different data formats; from CSV to ORC, Parquet, AVRO and others with the simple Azure data storage operations, like Create, Write, read and transform.

This blogpost will cover:
1. create Azure storage account and set the permissions
2. create connection to SQL Server database using Python
3. export data from SQL Server database (AdventureWorks database) and upload to Azure blob storage and
4. benchmark the performance of different file formats

1. Create Azure storage account

Create Resource group and storage account in your Azure portal. I will name the resource group “RG_BlobStorePyTest“.

Next, for this resource group, I will create a storage account name.

Under the advanced tab, leave the defaults. In particular, I will need “Enable blob public access” to be turned on. And under connectivity, “Public network (all endpoints)” will be selected.

Both tasks can be created using Powershell. With these two commands, and using the same Resource group and storage account names with all the defaults.

#Connect to your Azure subscription
Connect-AzAccount

# Create Resource Group
az group create RG_BlobStorePyTest --location westus2

# Create the storage account
az storage account create -n storeaccountpytest -g RG_BlobStorePyTest

# Create a container object
$ctx = New-AzStorageContext -StorageAccountName storeaccountpytest -UseConnectedAccount
$container = New-AzStorageContainer -Name blobcontainerpytest -Context $ctx

After the Azure storage account and creation of container, we need to get the access to container. We can use SAS (Shared Access Signature). Under the container, click the SAS:

And select the validity of SAS, Allowed services, resource types and permissions:

And click Generate SAS and connection string. Save the credentials for the storage account you have just created.

2. Python and Storage account connection

I will be using Python pypyodbc to connect to local installation of SQL Server. With this package I will be able to get to SQL Server data using ODBC connection.

import sys
import pypyodbc as pyodbc
import datetime
import pandas as pd

# connection parameters
nHost = '(localdb)\MSSQLlocalDB'
nBase = 'AdventureWorks2019'

def sqlconnect(nHost,nBase):
    try:
        return pyodbc.connect('DRIVER={SQL Server};SERVER='+nHost+';DATABASE='+nBase+';Trusted_connection=yes')
    except:
        print ("connection failed check authorization parameters")  

con = sqlconnect(nHost,nBase)
cursor = con.cursor()

sql = ('''select * from [Sales].[vIndividualCustomer]''')

query = cursor.execute(sql)
row = query.fetchall()
con.close()

# store results to data.frame
df = pd.DataFrame(data=row)

# add column names
df.columns = ['BusinessEntityID','Title','FirstName','MiddleName','LastName','Suffix','PhoneNumber','PhoneNumberType','EmailAddress','EmailPromotion','AddressType','AddressLine1','AddressLine2','City','StateProvinceName','PostalCode','CountryRegionName','Demographics']

In addition, we also need some Python packages to connect to Azure blob storage. Following packages will do most of the work:

from azure.storage.blob import PublicAccess
from azure.storage.blob import BlobServiceClient
import os

But first, we need to do the export.

3. Exporting data from on-prem SQL Server to Blob Store

Exporting can be done simply with writing to a file, which is uploaded to blob store. Following previous Python script block, we will add the output to CSV file.

#to csv
df.to_csv('AWorks2019.csv')

File is created on local machine (host) and can be prepared to be uploaded to the blob store. Make sure you copy and paste the SAS and connection string in the python code accordingly.

from azure.storage.blob import BlobServiceClient, BlobBlock
from azure.storage.blob import PublicAccess
from azure.storage.blob import BlobServiceClient
import os

# Get Connection string from Container SAS. starts with: BlobEndpoint=https:// ...
connection_string = ""

blob_service_client = BlobServiceClient.from_connection_string(connection_string)
container_client = blob_service_client.get_container_client("blobcontainerpytest") 

And the last part is to upload the file, that was just created on local machine, to the blob store:

#get name of the file for blob Store
blob_client = container_client.get_blob_client("AWorks2019_Uploaded.csv")

# get the local file name
with open("AWorks2019.csv", "rb") as data:
    blob_client.upload_blob(data)

You can upload the file to the file structure or simply upload to root folder of container.

4. Benchmark the performance of different file formats

Last part of this blog post is about performance of different file formats!

CSV format is relative old format. It is almost like an “industry standard”, but this does not make this format always efficient. Especially with writing to CSV file or reading from CSV file. There are other file formats, which can be much more convenient for storing data in the cloud or transferring data between the blob storage containers. Formats like AVRO, Parquet, ORC, pickle can achieve better performance (in terms of writes or reads) and can take up less space on your storage. Since storage is cheap, the network traffic can be expensive, both in time and money.

Let’s compare these files in terms of speed of writing and size itself. I have prepared couple of Python functions:

import timeit
import time
import numpy as np
import pandas as pd

import feather
import pickle
import pyarrow as pa
import pyarrow.orc as orc 
from fastavro import writer, reader, parse_schema

number_of_runs = 3


def Create_df():
    np.random.seed = 2908
    df_size = 1000000
    
    df = pd.DataFrame({
        'a': np.random.rand(df_size),
        'b': np.random.rand(df_size),
        'c': np.random.rand(df_size),
        'd': np.random.rand(df_size),
        'e': np.random.rand(df_size)
    })


def WRITE_CSV_fun_timeIt():
    print('Creating CSV File start')
    df.to_csv('10M.csv')
    print('Creating CSV File  end')
    

def WRITE_ORC_fun_timeIt():
    print('Creating ORC File start')
    table = pa.Table.from_pandas(df, preserve_index=False)
    orc.write_table(table, '10M.orc')
    print('Creating ORC File  end')

def WRITE_PARQUET_fun_timeIt():
    print('Creating PARQUET File start')
    df.to_parquet('10M.parquet')
    print('Creating PARQUET File  end')

def WRITE_PICKLE_fun_timeIt():
    print('Creating PICKLE File start')
    with open('10M.pkl', 'wb') as f:
        pickle.dump(df, f)
    print('Creating PICKLE File  end')

This is a sample code, created for purposes of benchmark. In real-case scenario, you would have this a database connector and export the data from SQL Server database to files. But running this code shows that CSV is far the slowest when it comes to writing to a file. In terms of the file size, CSV is also the biggest file.

Run the following script as:

Create_df()

# results for write
print(timeit.Timer(WRITE_CSV_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_ORC_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PARQUET_fun_timeIt).timeit(number=number_of_runs))
print(timeit.Timer(WRITE_PICKLE_fun_timeIt).timeit(number=number_of_runs))

CLEAN_files()

And you will see how which format performs. Here is a sample of test:

And you can see, that ORC, Pickle and PARQUET were amongst the fastest one. And the slowest was CSV file format.

Conclusion

Engineering data in fast and scalable way is important aspect of building data lake. With vast majority of different raw files and formats the single and centralised data repository might get swampy. And designing and storing data (especially large amounts of structures, and semi-structured data) in a way, that will give you not only better performances but also give you the freedom of choosing better data formats, is something, that data engineers should do. Ignoring these facts can cripple your performance on a long run.

Sorry for long post, but as always the complete code is available on the Github in the benchmark file formats repository (or click here). Complete Jupyter Notebook is here. And the benchmark Python file is here.

And check Github repository for the future updates.

Happy coding and stay healthy!

Tagged with: , , , , , , , , , , ,
Posted in Uncategorized

Little useless-useful R functions – Finding total sum of all sub-arrays and sum of maximums

When it comes to O(N log(N)) time complexity functions, this is the right section. 🙂

Given an array of natural numbers (later, you will see, it can be real, or any other less complex system (rational, irrational)) and you create a tree of all permutations preserving conjugation order.

array = c(1,2,3,4)
#sub-array with sum in [] brackets
1. c(1) [1]
2. c(1,2)     [3]
3. c(1,2,3)  [6]
4. c(1,2,3,4) [10]
6. c(2) [2]
7. c(2,3) [5]
8. c(2,3,4) [9]
10. c(3) [3]
11. c(3,4) [7]
13. c(4) [4]

with total SUM = 1+3+6+10+2+5+9+3+7+4 = 50

With simple R function, we can achieve this by:

arr = c(1,2,3,4,4,5,6,7,7,6,5,4,3,1)

sumArr <- function(x){
  summ <- 0
  i <- 1
  for (i in 1:length(arr)) {
    j <- i + 0
    midsum <- 0
    for (j in j:length(arr)) {
      midsum <- sum(arr[i:j])
      summ <- summ + midsum
      #print(sum)
    }
  }
  cat(paste0("Total sum of sub-arrays: ", summ))
 }

#calling function
sumArr(arr)

Ok, this was useless and straightforward. What if we decide to find the maximums of each array and create a total sum:

array = c(1,2,3,4)
#sub-array with sum in [] brackets
1. c(1) [1]
2. c(1,2)     [2]
3. c(1,2,3)  [3]
4. c(1,2,3,4) [4]
6. c(2) [2]
7. c(2,3) [3]
8. c(2,3,4) [4]
10. c(3) [3]
11. c(3,4) [4]
13. c(4) [4]

with total SUM = 1+2+3+4+2+3+4+3+4+4 = 30
sumArrOfMax <- function(x){
  summ <- 0
  i <- 1
  for (i in 1:length(arr)) {
    j <- i + 0
    midsum <- 0
    for (j in j:length(arr)) {
      midsum <- max(arr[i:j])
      summ <- summ + midsum
      #print(sum)
    }
  }
  cat(paste0("Total sum of maximums of all sub-arrays: ", summ))
}

# run function
sumArrOfMax(arr)

As always, code is available on the Github in the same Useless_R_function repository. Check Github for future updates.

Happy R-coding and stay healthy!“

Tagged with: , , ,
Posted in Useless R functions

Little useless-useful R functions – Animating datasets

I firmly believe that animation and transition between different data states can give end-users much better insights and understanding of the data, than a single table with data points or correlation metrics.

With help of ggplot, gganimate, you can quickly create an animation based on your needs. This is a simple IRIS dataset example.

Animation of data points with the transitions

And the useless function with the transitions between the data points:

ggplot(iris, aes(factor(Species), Sepal.Length, colour = Species)) +
  geom_boxplot(show.legend = FALSE) +
  labs(title = 'Petal width: {as.numeric(format(round(frame_time, 2), nsmall = 2))}', x= 'Iris species', y =  'Sepal Length') +
   # transition_time(as.numeric(Petal.Width)) +
  transition_time(as.numeric(Petal.Width)) +
  ease_aes('sine-in-out') +
  enter_fade() + 
  theme_hc()

As always, code is available on the Github in the same Useless_R_function repository. Check Github for future updates.

Happy R-coding and stay healthy!“

Tagged with: , , , ,
Posted in Useless R functions

Simple R merge method and how to compare it with T-SQL

Merge statement in R language is a powerful, simple, straightforward method for joining data frames. Nevertheless, it also serves with some neat features that give R users fast data wrangling.

I will be comparing this feature with T-SQL language, to show the simplicity of the merge method.

Creating data.frames and tables

We will create two tables using T-SQL and populate each table with some sample rows.

CREATE TABLE dbo.Users (
     UserID INT
    ,GroupID CHAR(1)
    ,DateCreated datetime
    ,TotalKM INT
    ,Age INT )

CREATE TABLE dbo.UserRun (
     UserID INT
    ,GroupID CHAR(1)
    ,Run INT
    ,RunName VARCHAR(50) )

INSERT INTO dbo.Users
          SELECT 1, 'X','2022/04/10',22,34
UNION ALL SELECT 1, 'X','2022/04/11',33,34
UNION ALL SELECT 2, 'Y','2022/04/13',33,41
UNION ALL SELECT 2, 'Y','2022/04/14',42,41
UNION ALL SELECT 3, 'X','2022/04/11',6,18
UNION ALL SELECT 4, 'Z','2022/04/13',8,56

INSERT INTO dbo.UserRun
          SELECT 1, 'X',1,'Short'
UNION ALL SELECT 1, 'X',0,'Over Hill'
UNION ALL SELECT 2, 'Y',0,'Short'
UNION ALL SELECT 2, 'Y',0,'Miller'
UNION ALL SELECT 3, 'X',0,'River'
UNION ALL SELECT 4, 'Z',0,'Mountain top'
UNION ALL SELECT 5, 'T',0,'City'

And create data.frames with the same data for R language:

Users = data.frame(UserID = c(1,1,2,2,3,4) 
                     ,GroupID = c("X","X","Y","Y","X","Z")
                     ,DateCreated = c("2022-04-10","2022-04-11","2022-04-13","2022-04-14","2022-04-11","2022-04-13")
                     ,TotalKM = c(22,33,33,42,6,8)
                     ,Age = c(34,34,41,41,18,56)
)

UserRun = data.frame(UserID = c(1,1,2,2,3,4,5)
                     ,GroupID = c("X","X","Y","Y","X","Z","T")
                     ,Run = c(1,0,0,0,0,0,0)
                     ,RunName = c("Short", "Over Hill","Short","Miller","River","Mountain top","City")
                     
)

Merge and T-SQL joins

Merging data is part of the daily data engineering task. Handling data and delivering results is crucial and the understanding Merge method is relevant for correct result delivery.

Simple T-SQL inner join:

SELECT  *
FROM dbo.Users as U 
JOIN dbo.UserRun AS UR
ON U.UserID = UR.UserID
AND U.GroupID = UR.GroupID

And on the other side, R Merge method for inner join:

IJ_Us_UsR <- merge(x=Users, y=UserRun, by = c("UserID", "GroupID"))
IJ_Us_UsR

gives same result.

Interoperable semi-joins

The order of SQL tables in the join clause is important for the join operation that determines the output of the query. The order of the tables in the join clause with semi-joins therefore must be carefully chosen. Taking the SQL query from above and changing the INNER to RIGHT returns another additional row:

SELECT *
FROM dbo.Users as U 
RIGHT JOIN dbo.UserRun AS UR
ON U.UserID = UR.UserID
AND U.GroupID = UR.GroupID

And the “same” result can be achieved by using LEFT JOIN with reversed table order:

SELECT *
FROM dbo.UserRun as U 
LEFT JOIN dbo.Users AS UR
ON U.UserID = UR.UserID
AND U.GroupID = UR.GroupID

In R Merge method, things are simplified when using semi- (or anti-) joins. But thus, extra caution must be considered.

We can achieve left join with:

# Left join all.x = TRUE
LJ_Us_UsR <- merge(x = Users, y = UserRun, by = c("UserID", "GroupID") , all.x = TRUE) 

but with switching all.x = TRUE to all.y = TRUE, we can get the right join:

#right join all.x = TRUE
RJ_Us_UsR <- merge(x = Users, y = UserRun, by = c("UserID", "GroupID") , all.y = TRUE) 

But the same goes, if we reverse the data.frame order and get the left join:

LJ_Us_UsR <- merge(x = UserRun, y = Users, by = c("UserID", "GroupID") , all.y = TRUE) 

But omitting the all.y = TRUE we can easily slip into inner join:

#right join all.x = FALSE becomes inner join
RJ_Us_UsR <- merge(x = Users, y = UserRun, by = c("UserID", "GroupID") , all.y = FALSE) 

And if you leave the defaults on, or ignore all attributes, you might again get the different result as expected! Even though the merge statement is short, simple, and powerful, you should be careful about the code.

Joins without unique value

Both data frames have one (or two) column(s) to define a unique user. But both columns UserID and GroupID are repeated in both data frames. And since there is no uniquely defined column(s), with repeated values in columns that are part of the ON clause, there will be a multiplication of rows.

Both data frames have UserID = 1 (from same groupID) inserted two times. When joining the data frames, we can specify how to join, but the uniqueness is still missing and we get the cartesian product of 2×2 = 4 rows in the final result for userID = 1.

Logically, we would want to get UsedID = 1 only two times. With T-SQL joins, we would need to create a windows function and select the unique ones (sort of unique value) and return the result. There are many other ways, but essentially, this is the database design question!

With R, the merge statement is powerful enough to do this by using row.names instead of defining the ON clause.

# Join by Row Names / Internal unieque value
RowNameJ_Us_UsR <- merge(x=Users, y=UserRun, by = "row.names")
RowNameJ_Us_UsR

Same result can be achieved by using defining the 0th column for join:

RowNameJ_Us_UsR <- merge(x=Users, y=UserRun, by = 0)  

If we define the by attribute with NULL instead of 0, we get the Cross-join!

CJ_Us_UsR <- merge(x = Users, y = UserRun, by = NULL )  
CJ_Us_UsR

Joins with than two data frames

Merge statement can be used also to join three or more data frames in R. Let’s create the third data frame:

# Joining more than two data.frames
Run = data.frame(UserID = c(1,1,2,2,3,4,6,8) 
                   ,GroupID = c("X","X","Y","Y","X","Z","H","K")
                   ,Trainer = c(1,1,1,1,1,0,0,0)
)

A merge statement is designed two work with two data. frames at once, but nesting the merge statements will bring the next data frame to the previous result set. So if you would be joining four data frames, you would end up with 3 (n-1) merge statements.

Three_IJ <- merge(merge(x=Users, y=UserRun, by = c("UserID", "GroupID")), y=Run, by.y = c("UserID", "GroupID"))
Three_IJ

As always, you can also use many other R packages to achieve the same result, e.g.: dplyr, data.table, tidyverse, sqldf just to name a few.

Code is available at Github in Useless – Useful R functions repository.

Happy R-Coding and Happy Easter 2022!

Tagged with: , , , ,
Posted in Uncategorized, Useless R functions

Python and R have matrix and simple loops. What can you use in T-SQL?

Many of you have already used any programming or scripting language. And you can tell that, there are some functionalities that are intuitive (for loops, enumerations, collections, …), data types and also short-hands.

One of these data types is a matrix. Commonly used in a programming language, yet fairly unknown in T-SQL. You can say, that matrix is a list of lists or an array of arrays. Or simply a table 😉

Numpy matrix

Scripting languages offer a super simple solution to walk through a matrix. Using enumerators or loops, you can check each element using a simple loop. For example with python code:

for i in range(0,3):
    for j in range(0,3):
        print(grid[i][j])

And you can even add some logic. For example, if the value equals a predefined value then print the correct result.

for i in range(0,3):
    for j in range(0,3):
        if grid[i][j] == 23:
            print("true", i, j)
        print("not exists", i, j)

And in this case, you will get a simple result print:

Result of double-loop

The same operation can be simply created also in R.

mat = matrix(c(12,11,23,23,34,35,51,55,56),nrow = 3, ncol = 3,byrow = TRUE)
print(mat)

for (i in 1:3){
  for (j in 1:3){
    print(mat[i,j])
  }
}

And with IF statement embedded.

for (i in 1:3){
  for (j in 1:3){
    if (mat[i,j] == 23){
      print(paste("true", i, j, sep = " "))
    } else {
      print (paste("not exists", i, j, sep = " "))
    }
  }
}

It is obvious, that both languages are delivering these operations with little syntax and very efficiently.

As you can imagine, doing such an operation in any other scripting or programming language would not take much more additional programming and would be fairly similar.

Why would this operation be so annoying to write in T-SQL? Tables in SQL Server are columns with values in rows. And columns are usually presented in select list, where as rows, are filtered in WHERE clause. So combining these two query parts is somehow upside down, as e.g.: in this python code. Meaning, that looping through rows should not be a problem – because this is the basic logic of SQL – but looping through columns is something that is usually defined and is not part of dynamic query or last second wishlist. When writing a query, upfront user knows, what information (column) would like to retrieve from the table.

So we want to slightly change this logic. And you ask why? Well, because of simpler rows and columns manipulation.

Create an SQL table

Let’s create an SQL Table and store same data as in previous Python matrix:

DROP TABLE IF EXISTS dbo.grid2;
CREATE TABLE dbo.grid2
( ID TINYINT IDENTITY(1,1)
, [1] TINYINT NOT NULL
, [2] TINYINT NOT NULL
, [3] TINYINT NOT NULL)


INSERT INTO dbo.grid2 ([1],[2],[3])
                      SELECT 12,11,23
UNION ALL SELECT 23,34,35
UNION ALL SELECT 51,55,56

The result of simple select statement against this table is:

Matrix look-a-like with index column ID

Procedure

For the sake of sanity, I “cheated” here 🙂 Adding an indexing column ID for simpler SQL data manipulations and column names are enumerated as [1], [2], [3]. I could have used letters or any ASCII letter that would be part of ASCII series. e.g.: ASCII(50), ASCII(51), ASCII(52), …. so that the integers can be simply increased or decreased.

Ideally, we are looking for a T-SQL notation: get_grid(row, column) that would return a single value. For example: get_grid 2,3 it would return value 35 (based on data in table; row=2, column=3).

We want this function to be clean and simple to use with further functions or procedures.

After couple of iterations

CREATE OR ALTER PROCEDURE dbo.get_grid
(@row TINYINT
,@col TINYINT
,@result TINYINT OUTPUT
)
AS
BEGIN
	DECLARE @sql NVARCHAR(1000)
	SET @sql = '
		SELECT 	
			 '+CAST(QUOTENAME(@col) as VARCHAR(100))+' 
		FROM dbo.grid2 as g
		WHERE
			g.ID = '+ CAST(@row AS VARCHAR(100)) +' 
	'

	DECLARE @t table (i tinyint)
	INSERT INTO @t
	EXEC sp_executesql @sql

	SET @result = (SELECT i FROM @t)
	 
END;
GO

So calling this procedure with output parameter would look like:

DECLARE @v TINYINT; 
EXEC dbo.get_grid 2,3, @v OUT
SELECT @v

Running double loop or nested loop

Now pretend that the table is a matrix, and we are looping through the matrix:

DECLARE @i INT = 1
WHILE @i <= 3
    BEGIN
    DECLARE @j INT = 1
    WHILE @j <= 3
        BEGIN
                DECLARE @v  TINYINT  = 0
                EXEC dbo.get_grid
                    @i
                    ,@j
                    ,@v OUT
                
                IF (@v = 23) 
                    SELECT 'True', @i, @j, @v
                ELSE
                    SELECT 'Not Exists', @i, @j, @v
        SET @j = @j + 1
        END
SET @i = @i + 1 
END

We get the same results as with Python, with slightly more overhead code.

Now, you will ask. But why? Well, this is just a helper T-SQL function for easier board-game development in T-SQL, where there are matrix numbers needed.

As always, complete code is available at Github repository for the Sudoku T-SQL game.

Stay healthy!

Tagged with: , , , , , , , , ,
Posted in Uncategorized

Kadane’s algorithm – finding maximum sum in contigous sub-array

Great algorithm for analyzing timeseries data or array of numbers.

An algorithm for finding the contiguous subarray within a one-dimensional array of numbers that has the largest sum. It is called Kadane’s algorithm.

Largest sum

How does the algorithm work? It looks for a global maximum of positive-sum on any sub-array, regardless of its starting position or its length. Numbers must be next or together in a sequence (without any number left out).

With formula as: local_maximum at index i is the maximum of A[i] and the sum of A[i] and local_maximum at index i-1. So, you calculate the sum of every possible subarray ending with the element array[n-1]. Then, we would calculate the sum of every possible subarray ending with array[n-2], array[n-3] and so on up to array[0].

With a simple function we can iterate through the array (vector) of values:

#sample vector
v <- c(-3,-8, 1, -2, 1, 5, -3, -4, 3, 10, -2, 4, 1)

kadane <- function(v){
 
  max_so_far = -999999999999999 #some obnoxiously big number
  max_ending_here = 0
  
  for (i in 1:length(v)) {
    max_ending_here = max_ending_here  + v[i]
    if (max_so_far < max_ending_here ){
      max_so_far = max_ending_here
    }
    if (max_ending_here < 0) {
      max_ending_here = 0
    }
  }
  return (max_so_far)
}

Run the function with:

# run the function
kadane(v)

The function can also return the starting and ending positions of the array.

kadane_with_position <- function(v){
  max_so_far = -999999999999999 #some obnoxiously big number
  max_ending_here = 0
  subarray_start = 0
  subarray_end = 0
  int_s = 0
  
  for (i in 1:length(v)) {
    max_ending_here = max_ending_here  + v[i]
    if (max_so_far < max_ending_here ){
      max_so_far = max_ending_here
      subarray_start = int_s
      subarray_end = i
    }
    if (max_ending_here < 0) {
      max_ending_here = 0
      int_s =  i + 1 
    }
  }
  cat("Sum is: ", max_so_far, " with starting position: ", subarray_start, " and ending: ", subarray_end)
}

# run the function
kadane_with_position(v)

Kadane’s algorithm has also an interesting time complexity function. With O(n) for an array, it can explode exponentially when using 2D matrices: O(n^3).

R library Adagio offers this function with ability to calculate over matrices.

As always, code is available at Github in the same Useless_R_function repository. Check Github for future updates.

Happy R-coding and stay healthy!“

Tagged with: , , , , , ,
Posted in Uncategorized

Little useless-useful R functions – benchmarking vectors and data.frames on simple GroupBy problem

After an interesting conversation on using data. frames or strings (as a vector), I have decided to put this to a simple benchmark test.

The problem is straightforward and simple: a MapReduce or Count with GroupBy.

Problem

Given a simple string of letters:

BBAARRRDDDAAAA

Return a sorted string (per letter) as a key-value pair, with counts per each letter, as:

6A2B3D2R

And in the conversation was the question of optimization of the input data. As the one in a form of string or as an array of rows of data. Sure, anything that is already chunked would perform better. But when?

Data

Let’s create a string of letters and data. frame with letters. Creating the various lengths and consisting of 20 letters (unique).

len <- 100000  # 10^6
lett <- c(LETTERS[1:20])

set.seed(2908)

a_vec <- do.call(paste0, c(as.list(sample(lett,len,replace=TRUE)), sep="")) # a vector
a_df <- data.frame(chr = paste0(sample(lett,len,replace=TRUE), sep="")) # a data frame
Difference between dataframe and vector

Test

With the rbenchmark framework, I used 20 replications for each test. There are many more solutions to this, initially, I wanted to test the break-point, where string becomes slower than data. frame.

rbenchmark::benchmark(
"table with vector" = {
    res_table <- ""
    a_table <- table(strsplit(a_vec, ""))
    for (i in 1:length(names(a_table))) {
      key<- (names(a_table[i]))
      val<-(a_table[i])
      res_table <- paste0(res_table,key,val)
    }
},
"dplyr with data.frame" = {

res_dplyr <- a_df %>%
    count(chr, sort=TRUE) %>%
    mutate(res = paste0(chr, n, collapse = "")) %>%
    select(-chr, -n) %>%
    group_by(res)

res_dplyr[1,]
},
"purrr with data.frame" = {
  adf_table <- a_df %>% 
    map(~count(data.frame(x=.x), x))
  
    res_purrr <- ""
    for (i in 1:nrow(adf_table$chr)) {
      key<- adf_table$chr[[1]][i]
      val<- adf_table$chr[[2]][i]
      res_purrr <- paste0(res_purrr,key,val)
    }
},
  replications = 20, order = "relative"
)

Conclusion

After running 10 tests with different input lengths, the results were on my personal laptop (16GB ram, Windows 10, i5-8th gen, 4 Cores).

Results

Results show that the maximum length that R function strsplit in combination with the table still works fast is for the strings shorter than 1.000.000 characters. Anything larger than this, data. frames will do much better results.

As always, code is available at the Github in the same Useless_R_function repository. Check Github for future updates.

Happy R-coding and stay healthy!“

Tagged with: , , , , ,
Posted in Uncategorized, Useless R functions
Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

MsSQLGirl

Bringing value to data & insights through experiences users love

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...