NASA has been collecting surface temperature for more than over 100 years, and the GISS Surface Temperature analysis from 1888 onward. It is an estimate of global surface temperature change.
The temperature analysis schema was defined in the late 1970s by James Hansen and the complete analysis and method are documented in Hansen and Lebedeff (1987). To put it simply is a method of global temperature change that is used for comparison with one-dimensional global climate models.
Data
The website (https://data.giss.nasa.gov/gistemp/ ) offers loads of data and I am using the Global-mean monthly, seasonal, and annual means, 1880-present, updated through most recent month: TXT, CSV
It need some prior data preparation, namely min and max values, as it is required by radarchart function.
library(ggradar)
library(fmsb)
library(scales)
library(RColorBrewer)
#data txt and preparation
df <-read.csv("Documents/GLB.Ts+dSST.csv",header = TRUE, sep = ",", skip = 1, dec="." )[1:13]
df <- as.data.frame(sapply(df[1:143,], as.numeric))
df_months <- names(df)[2:13]
df_years <- df$Year
rownames(df) <- df_years
df <- df[,2:13]
# adding max min
max_min <- data.frame(
Jan = c(1.4, -0.85), Feb = c(1.4, -0.85), Mar = c(1.4, -0.85),
Apr = c(1.4, -0.85), May = c(1.4, -0.85), Jun = c(1.4, -0.85),
Jul = c(1.4, -0.85), Aug = c(1.4, -0.85), Sep = c(1.4, -0.85),
Oct = c(1.4, -0.85), Nov = c(1.4, -0.85), Dec = c(1.4, -0.85)
)
rownames(max_min) <- c("Max", "Min")
#merging
df <- rbind(max_min, df)
# Set graphic colors
nb.cols <- length(df_years)
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(nb.cols)
colors_border <- mycolors
colors_in <- alpha(mycolors, 0.3)
Data visualisation
Adding the radarchart and looping through the years:
for (i in 1:length(df_years)){
y <- df_years[1:i]
df_tmp <- df[rownames(df)%in%y,1:12]
df_tmp <- rbind(max_min, df_tmp)
radarchart( df_tmp, maxmin=TRUE, axistype=1,seg=3,vlabels = df_months,
plwd=0.5 , plty=1,centerzero=FALSE,caxislabels = c(-1, 0, 1, 1.4),
cglcol="grey", cglty=2, axislabcol="black",
vlcex=1.2,
title= paste0("GISS Surface temperature for years until ", tail(y,1)) )
legend(x=-0.35, y=0.15, legend = tail(y,1), bty = "n", pch=30 , col=colors_in , text.col ="black", cex=1.3, pt.cex=3)
}
Changes through time stretch from 1888 until 2023. The scale of this radar charts go from -1ºC to 1.4ºC (with 0 and 1 as reference lines).
Year 1900
Year 1950
Year 2000
Year 2022
The spikes and bursts in temperature over the past 30 years are mind-boggling 😦
As always, code is available on Github in Useless_R_function repository. The sample file in this repository is here (filename: Climate_spiral.R) Check the repository for future updates.
What a nightmare was to type a short message on these keypads. So, image writing Hello on this keypad, you had to press: 4433555555666 to get the letters “hello”.
44 = h 33 = e 555 = l 555 = l 666 = o
So creating converter would be great to troll on your friends 🙂
By using this useless function, you can now convert text into numbers.
SMSconverter <- function(tt){
st <- NULL
tti <- unlist(strsplit(paste0(tt, " "), ""))
# check if input string are letters
if (!grepl("[^A-Za-z]", tti[1]) == TRUE){
for (i in 1:nchar(tt)){
lt <- substr(tt,i,i)
if (lt != " "){
rn <- substr(rownames(which(mm == lt, arr.ind = T)),2,2)
rep <- which(mm == lt, arr.ind = T)[2]
st <- c(st, replicate(rep, rn))
} else { st <- c(st, "0") }
}
print(paste0(st, collapse = ""))
}
# check if input string are numbers
if(!grepl("\\D", tti[1]) == TRUE){
tti <- unlist(strsplit(paste0(tt, ""), ""))
st <- NULL
tti <- unlist(strsplit(as.character(tt), ""))
tmp <- rle(tti)
for (i in 1:length(tmp$lengths)){
rpt <- tmp$lengths[i]
row_cnt <- tmp$values[i]
lt <- mm[as.integer(row_cnt)-1,rpt]
st <- c(st, lt)
}
print(paste(st, collapse=""))
}
}
You can run the function as:
SMSconverter("text")
and it will output you the sequence of numbers, corresponding to the text: 833998.
The function is created to convert the text to numbers and numbers to text. Yet, the nondeterministic nature of numbers prevents correct conversion in the latter case.
Let me give you an example, with my name “tomaz”. The corresponding number conversion is 8666629999.
8 = t 6666 = can be: 6, 666 or 66, 66 or 666, 6 with solutions: m,o or nn or o,m 2 = a 9999 = z
So solutions can be tmoaz, tnnaz or tomaz. Well, one might need to use also the spell checker. Anyways, the conversion fails when there are multiple letters from the same key.
As always, complete code with helper datasets is available on Github in Useless_R_function repository. The sample file in this repository is here (filename: Convert_text_to_number.R) Check the repository for future updates.
This is a great opportunity and I am honoured to be hosting this months T-SQL Tuesday blogging invitation. With the invitation of Steve, we have agreed to post topic on Data science.
I will be receiving all of your answers on blog posts and twitter (make sure to add #tsql2sday).
Data Science in the time of Chat GPT
Instead of writing and asking Data science questions, let’s discuss the aspects of Data science with the presence of Chat GPT 4.0.
By now, it is known to everyone that Chat GPT is a language model (LLM – Large Language Model) that is based on the GPT (Generative Pre-trained Transformer) architecture. It uses deep learning algorithms to like neural nets with billions of weights and transformers, that generated the sequence of tokens, that make up a piece of text.Transformers introduce the concept of “paying attention” to generally build better sequence of text. It operates primarily with probabilities of words and their sequence and therefore it is also good for human-like responses to natural language queries, making it great for a conversation-like experience.
There are many of the caveats hidden in the processing of text, adjustments of weights, functions (different and tweaked versions of Relu), additional corpora and billions of text for model training and many additional texts.
I have prepared two groups of questions. And I will not go into debate, if the end of data science is near, nor will go into debate, if the AGI (artificial general intelligence) will completely replace the role of data scientists. What I want to hear from you is simply how did you embrace (if at all) the use of Chat GPT, and what were your first impressions. And mostly, how did it help you (if at all), what did you use it for, and have you encountered any traps?
Usage and working along Chat GPT
Imagine using SQL, R, Python, Julia, or Scala, for your daily data science work. And you can practically ask Chat GPT anything and it will return you a relatively coherent and good answer. If you need an explanation, it will excel. Where and what have you used it for? Here is a short list, that might get you started:
Explain the data science algorithn?
Help tune or create SQL code to query big data
Prepare R, Python, Scala code for exploring the data
Help you prepare the training of the model in desired language
Prepare the code for hyperparameter tunning and cross-validation
Ask for data visualization for given dataset
Help create dashboard
Create code for model deployment, model re-training or model consumption
Ask for preparing custom functions and algorithm/function adjustments?
Now, that you have added and found the list of where and how it did help you, I would like to understand, how did this help you? Feel free to make a general comparison and add some explanations. And lastly, of course, add, if this has in any kind of way compromise your work as a data scientist (in a term of embracing it in – a positive way, or in terms of a negative experience).
Responsible usage
We have seen many controversies around Chat GPT emerge. Some European Union countries have banned it, and some will so be doing it too. And the question is not only its use (as the end of humanity and empathy) but also the misuse of personal data, privacy issues and leaking of relevant, corporate information.
Have you considered responsible usage of Chat GPT? Here is again the short list for helping you:
The use of personal data retrieved from the model
Inserting sensitive (personal or company) data
Explaining the section of R, Python, Scala code, that is the property of your enterprise
Instead of this, have you tried using it more responsibly:
Using pseudo code for explanation of the algorithm
Using mock data rather than real data
Giving pseudo-code in order to receive the documentation
Skipping on sensible data (SQL schema, model information, sensible data)
So which cases have you come across? Did it have any consequences for you? Which other responsible use of Chat GPT have you also done?
My takeaways
ChatGPT offers interesting answers (based on my experience and search), and it is the next step from a google search of Stackoverflow. In other words, it gives you a more focused answer. When exploring and searching forums, you might find several different solutions for a single problem, whereas here, you have to ask for another solution. And respectively, it can give you answer faster, in comparison to browsing the web. In both cases, both sides have their advantages and disadvantages, but non will assure you, that the answer is correct!
I embrace this technology as an additional learning source. But I personally do not use it as my daily driver, despite trying it out a couple of times (with mixed results; working and nonworking/useless/meaningless). It can be super helpful for entry/junior positions, but the more experienced you are, the more abstract data science work you and the more complicated topics you cover, less frequently you will presumably use it.
Writing markdown documents outside RStudio (using the usual set of packages) has benefits and struggles. Huge struggle is transforming dataframe results into markdown table, using hypens and pipes. Ugghhh…
This useless functions takes R dataframe as input and prints out dataframe wrapped in markdown table.
This can be directly used in any editor. (Sidenote: if you decide to use Latex, same function can be created, just different ASCII chars needs to be used – ampersend and backslash).
The super lazy function 🙂
df_2_MD <- function(your_df){
cn <- as.character(names(your_df))
headr <- paste0(c("", cn), sep = "|", collapse='')
sepr <- paste0(c('|', rep(paste0(c(rep('-',3), "|"), collapse=''),length(cn))), collapse ='')
st <- "|"
for (i in 1:nrow(your_df)){
for(j in 1:ncol(your_df)){
if (j%%ncol(your_df) == 0) {
st <- paste0(st, as.character(your_df[i,j]), "|", "\n", "" , "|", collapse = '')
} else {
st <- paste0(st, as.character(your_df[i,j]), "|", collapse = '')
}
}
}
fin <- paste0(c(headr, sepr, substr(st,1,nchar(st)-1)), collapse="\n")
cat(fin)
}
# run function
short_iris <- iris[1:3,1:5]
df_2_MD(short_iris)
As always, code is available on Github in Useless_R_function repository. The sample file in this repository is here (filename: Dataframe_to_markdown.R) Check the repository for future updates.
Keeping your R code organised is not as straightforward as one might think. Just think about the libraries, variables, functions, and many more. All these objects can be defined and later rewritten, some might get obsolete during the process.
This process is proven to be even more crucial when you are part of a larger group of engineers, and scientists, who collaborate with you.
Motivation
The most important step toward code reproducibility is to keep code organised, and atomic, storing it by layers or components and keeping up-to-date documentation. Because R language is a scripting language, organising files into directories and subdirectories is important for later re-usage, and collaboration with different departments in an organisation.
Ideally, the names of files and subdirectories are self-explanatory, so that one can tell at a glance what data files contain, what scripts do, and what came from what.
The following R tips are based on frequent problems, many organisations are facing. All R samples and themes are created to be fictional and can be attached to your organisational environment. Dataset used is the iris dataset, used to show the custom theme and use of functions. All images unless otherwise noted are by the author.
1. Organising R files
You can always call R files, libraries, functions and settings from a different file. This gives a great segue to creating a folder structure, where each developer can clone or access and get all necessary files, themes, and functions that the organisation is pushing.
Organising R folders and R scripts
Structuring R files, functions, data and many more is an essential step toward reproducibility.
Using Projects is a great place to start (also available in Posit — RStudio), but you can always create your own structure, that will help you with code and file organisation.
2. Installing and attaching R libraries
Installing and attaching R libraries is in almost all cases part of the R code. Whenever you are writing R code, there will be a point, that you will be referencing to an external library.
You don’t want to install and attach single or multiple libraries as the sample of the pseudo-code below.
In many enterprise environments, you might have issues with installing some packages on your local disk. These packages may contain *.zip or *.exe files and the security policy will deny them. In this case, the best solution is to install dedicated folder(s), as introduced in “Organise R files”. You will have to add the installation path and loading path for the packages.
The next step is to use a TXT file and write down all the packages needed for an R project/script. Let’s create a requirements.txt file (just like with YAML, Python,…) and put it inside package names.
Consider two R packages to help you achieve installation from the requirements.txt file. These two are requiRements and versions. Both are similar if your package list is stored in requirements.txt, but the versions function will also take the package version as input, which brings a whole new capability. On the other hand, the base function install.packages() gives you the possibility to specify in detail the arguments as repos, lib (path), destdir and many system variables. But these will be essential to store packages at the desired location.
If you want to simplify the process, you can always create a ZIP file of all working packages and restore and install it at any given time. In this case, it is advised to add the R version in the ZIP file as well.
Protip: use library() instead of require(). The first one will fail and give you a warning, whereas, the require() will silently fail, causing you later failures in the code.
3. Use the corporate themes
Every corporate environment should follow the theme, with predefined colours, table design, pixel-perfect diagrams and positions. With the ggplot package, you can create a theme, that will follow your design guidelines.
Furthermore, adding a condition to your theme, that the same colours will always reflect the same KPI, can also be achieved with the themes.
library(ggplot2)
iris <- iris
ggplot(data = iris, aes(Sepal.Length)) + geom_bar(color="grey", fill="red") +
labs(x = "Length of Sepal",
y = "Count of flowers",
title = "Number of flowers \nby sepal length",
caption = "Source: IRIS Dataset \nBase R Package")
theme_organisation <- function(){
font <- "Times New Roman"
theme(
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_rect(colour = "black", fill = NA, linetype = 3, size = 0.5),
panel.background = element_rect(fill = "#05a6f0"),
legend.position = "bottom",
plot.title = element_text( family = font, size = 20, face = 'bold', hjust = 0,vjust = 2),
axis.text = element_text(family = font,size = 9),
axis.text.x = element_text(margin=margin(5, b = 10))
)
}
ggplot(data = iris, aes(Sepal.Length)) + geom_bar(color="grey", fill="red") +
labs(x = "Length of Sepal",
y = "Count of flowers",
title = "Number of flowers \nby sepal length",
caption = "Source: IRIS Dataset \nBase R Package") +
theme_organisation()
library(magick)
logo <- image_read("../Useless_R_functions/image/myiriscompany.png")
#adding the logo
grid::grid.raster(logo, x = 0.1, y = 0.02, just = c('left', 'bottom'), width = unit(1.9, 'inches'))
Besides graphs, tables can also follow a similar theme. R package flextable offers a great framework for creating tables with astonishing formats, layouts, cell formats and plotting capabilities.
With both packages, you will be able to create corporative reports, with capabilities to export them to different tools or formats (word, PDF, PowerPoint, HTML, and others).
4. Use coding practices and never forget to document
There are many sections, that will improve your code readability and reusability. I have grouped them into scopes, that each delivers better code.
Documenting code
Starting your code with an annotated description of what the code does when it is run will help you when you have to look at or change it in the future. Give the author name, date and, change log.
Loading all of the dependencies, packages and files in accordance with your file structure. Also, add the global environments and R engine version. In addition, a nice way to do this is also to indicate which packages are necessary to run your code.
Use setwd()to determine the files (script, project or packages) location, unless there are standards in your organisation, that make this obvious.
Use comments to mark off sections of code.
Comment your code with care. Comments should explain the why, not the what. Add comments to your function with the added description of all input arguments and result set.
Syntax practices
Place spaces around all operators (=, +, -, <-, boolean, etc).
Use <-, not =, for the assignment.
When using packages with similar function names, add a package name to the function: dplyr::filter() and a Filter() function from base R.
To improve readability, indent the code inside the curly braces. You can also use the formatR package to help you refactor and indent your code.
Factor out common operations rather than repeating them. And keep your code in smaller chunks. If a single function or loop gets too long, consider looking for ways to break it into smaller pieces.
There is a 80 characters line, that will help you comfortably fit code on a printed page at a reasonable size. If you find yourself running out of room, consider encapsulating some of the work in a separate function.
Naming convention
There are many naming conventions to choose from and all are ok, as long as you are using the selected one consistently. I will just list a few:
alllowercase: e.g. irisdataset
period.separated: e.g. iris.dataset
underscore_separated: e.g. iris_dataset
lowerCamelCase: e.g. addIrisDataset
UpperCamelCase: e.g. AddIrisDataset
Keep names concise and meaningful, nouns and verbs should be used in functions and variables. Give the function a verb, eg.: add, calculate, reduce, and give a variable a noun, eg.: calculatedNumbers, vectorOfValues.
In general, you can separate helper functions with a prefix of “.”. And also distinguish between local and global variables, data objects and functions.
Also, store your files with meaningful names and always store them in *.R
Posit — RStudio tips
Choose your IDE. Consider using Posit — RStudio. My second favourite for writing R code is Visual Studio code.
There is no need to save the current workspace if you are writing reproducible code. You should be able to reproduce the workspace by re-running your script.
Keep track of data, variables, and functions versions, and use also integrated facilities to access SVN or Github.
R projects are a great way to organize your script files, and your outputs consider using Markdown to prepare finalised reports of your analysis
Check the memory used, use a garbage collector (gc()) and it always helps to keep session information in your project.
The complete code is available on Github in this repository.
When you are working with large datasets performance comes to everyone’s mind. Especially when converting datasets from one data type to another. And choosing the right method can make a huge difference.
So in this case, I will be creating a dummy list, and I will convert the values in the list into data.frame.
Simple function to create a large list (approx. 46MB with 250.000 elements and each element consists of 10 measurements).
cre_l <- function(len,start,end){ return(round(runif(len,start,end),8)) }
myl2 <- list()
# 250.000 elements is approx 46Mb in Size
# 2.500 elements for demo
for (i in 1:2500){ myl2[[i]] <- (cre_l(10,0,50)) }
The list will be transformed into data.frame in such a way that data.frame will have 10 (ten) variables with a number of observations corresponding with the length of the list. To give you the perspective, this code does exactly this:
for (i in 1:250){ myl2[[i]] <- (cre_l(10,0,50)) }
df <- data.frame(do.call(rbind, myl2))
And you end up from list to data.frame:
Fig 1: from List to Data.frame
There are many ways to convert a list to data.frame. But where it becomes important is, when your list is a larger object. I have written 8 ways to do the conversion (and I know there are at least 20 more).
By far the fastest method were do.call and sapply ways. Both outperforming all other methods with the following snippets:
Both methods were consistent with larger list conversions.
And the worst was for loop solution, reduce and as.data.frame. There are no surprises here, just a pause here, that for loop was performing so poorly due to constant row binding to an existing data.frame.
Fig 2 : Comparison results with different converting methods on list with 2500 elements
I have also removed the slowest performing conversions and created a 250.000 elements list and compared only the fastest methods with 10 consecutive runs (using the microbenchmark library) and mean values.
Fig 3 : Comparing the fastest methods on a list with 250000 elements
So, using for loops is super slow, and do.call with rbind or sapply will for sure deliver the best performances.
As always, the code is available at Github in repository Useless_R_function and the file is here.
The Mandelbrot set is a set of complex numbers c for which the function does not diverge to infinity when iterate from , and therefore remains bounded in absolute value.
For little stretching, we can create a Mandlebrot set and draw it with image function.
MandelBrotImage <- function(){
cols <- colorRampPalette(c("white","black","white","grey","black"))(11)
n <- 400
x <- seq(-2, 1, length.out=250)
y <- seq(-1.5, 1.5, length.out=250)
c <- outer(x,y*1i,"+")
z <- matrix(0.0, nrow=length(x), ncol=length(y))
k <- matrix(0.0, nrow=length(x), ncol=length(y))
for (rep in 1:n) {
for (i in 1:250) {
for (j in 1:250) {
if(Mod(z[i,j]) < 2 && k[i,j] < n) {
z[i,j] <- z[i,j]^2 + c[i,j]
k[i,j] <- k[i,j] + 1
}
}
}
}
image(x,y,k, col=cols, axes = FALSE, xlab = "" , ylab = "" )
}
# run function
MandelBrotImage()
As always, code is available on Github in Useless_R_function repository. The sample file in this repository is here (filename: MandelbrotSet.R) Check the repository for future updates.
R language and Azure Machine Learning SDK for R was deprecated a year ago (end of 2021). But R can be still used for training and deployment by using Azure Machine learning CLI 2.0!
Furthermore, R language can be used in Machine Learning Designer, for data preparation, data wrangling and statistical analysis.
Fig 1.: Using R in Azure Machine Learning Designer
Another way to user R is to use Posit (RStudio) on your compute instance. When you will be creating a new compute instance and you will get to the application usages
Fig 2.: Applications available for compute instance
You can also install RStudio (Posit). But you will need a R Workbench license in order to install it.
On the Posit website, you can purchase the product
Fig 3.: Posit + Azure
And from there on, you need to add an additional application when configuring a compute instance.
Fig 4.: Installing R Workbench by Posit
You enter the license key and the Posit (Rstudio) will be available as an application and development environment for your disposal.
So, there are ways to use R in Azure ML, and you can always choose your preferred tool.
100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)
sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009