# Effect of normalization of data

Clustering (distributed in particular) can  be dependent on normalization of data. With usage of distance models, data – when clustered – can produce different results or even different clustering models.

A simple every day example can produce two different results. For example, measuring units.  For this purpose we will create two R data-frames.

```person <- c("1","2","3","4","5")
age <- c(30,50,30,50,30)
height_cm <- c(180,186,166,165,191)
height_feet <- c(5.91,6.1,5.45,5.42,6.26)
weight_kg <- c(70,90,60,74,104)
weight_stone <- c(11,14.1,9.4,11.6,16.3)

sample1 <- data.frame(person, age, height_cm, weight_kg)
sample2 <- data.frame(person, age, height_feet, weight_stone)```

With a simple visualization, I have seen people arguing that even a simple scatter plot will produce different result. So let us try this:

```# create two simple scatter plots
#sample1
plot(sample1\$age, sample1\$height_cm, main="Sample1", xlab="Age",
ylab="Height (cm)", pch=19)
textxy(sample1\$age, sample1\$height_cm,sample1\$person, cex=0.9)

#sample2
plot(sample2\$age, sample2\$height_feet, main="Sample2", xlab="Age",
ylab="Height (feet)", pch=19)
textxy(sample2\$age, sample2\$height_feet,sample2\$person, cex=0.9)```

With result of: One can see there is relative or no difference when height is centimeters or in feet. So what authors in book:Finding Groups in Data: An Introduction to Cluster Analysis are proposing in chapter on “changing the measurement units may even lead one to see a very different clustering” does not simply hold water.

Excert from their book:  Both are perfectly manipulated scatter plots to support their theory. if you descale age between Figure 3 and Figure 4 and also keep proportions on height on both graphs, one should not get such “manipulated” graph.

Adding a line to both graphs and swapping X-axis with Y-axis, one still can not produce such a huge difference!

```#what if we switch X with Y since with the conversion of height we get different ratio

#sample1
plot(sample1\$height_cm, sample1\$age, main="Sample1", xlab="Height (cm)",
ylab="Age", pch=19)
textxy(sample1\$height_cm,sample1\$age,sample1\$person, cex=0.9)

#sample2
plot(sample2\$height_feet,sample2\$age, main="Sample2", xlab="Height (feet)",
ylab="Age", pch=19)
textxy(sample2\$height_feet,sample2\$age,sample2\$person, cex=0.9)

scatterplot(age ~ height_cm, data=sample1,xlab="Height (cm)", ylab="age",
main="Sample 1", labels = person)
scatterplot(age ~ height_feet, data=sample2,xlab="Height (feet)", ylab="age",
main="Sample 2", labels = person)```

Comparing again both graphs reveal same results: So far so good.  But what is we are doing distance based clustering? Well, story changes drastically. For sake of sample, I will presume there are two clusters and we will run kmeans, where n – observations will belong to cluster x based on their nearest mean. So for a given set of observation, a k-means cluster will partition observations into sets in order to minimize the distance between each point in cluster to its center. So minimization of sum of squares will be the  WCSS function (which will maximize the distance between the cluster center K).

In step 1 we will be using columns  Age and Height and observe the results. In step 2 we will add also Weight and observe the results.

```##########
#step 1
##########
sample1 <- data.frame(age, height_cm)
sample2 <- data.frame(age, height_feet)

# sample1
fit1 <- kmeans(sample1, 2)
# sample2
fit2 <- kmeans(sample2, 2)

#compare
fit1
fit2

##########
#step 2
##########
sample1 <- data.frame(age, height_cm, weight_kg)
sample2 <- data.frame(age, height_feet, weight_stone)

# sample1
fit1 <- kmeans(sample1, 2)
# sample2
fit2 <- kmeans(sample2, 2)

#compare
fit1
fit2```

In step 1; sum of squares within clusters for dataset Sample1 is 200 and 327, where for Sample2 is 0.33 and 0.23. Calculating Between_SumSqr / Total_SumSql, for Sample1 is 47% and for Sample2 is 99% which is a difference that would need a prio normalization.

In step 2; same difference is present, relatively smaller and calculating Between_SumSqr / Total_SumSql is for Sample1 63,1%, for Sample2 is 92,4%.

In both cases (step1 and step2) different observation would be belonging to different cluster center K.

Now we will introduce normalization of both weight and height, since both can be measured in different units.

```#~~~~~~~~~~~~~~~~~~~
# normalizing data
# height & weight
#~~~~~~~~~~~~~~~~~~~
height_cm_z   <- data.Normalization(sample1\$height_cm,type="n1",normalization="column")
height_feet_z <- data.Normalization(sample2\$height_feet,type="n1",normalization="column")

weight_kg_z   <- data.Normalization(sample1\$weight_kg,type="n1",normalization="column")
weight_stone_z <- data.Normalization(sample2\$weight_stone,type="n1",normalization="column")

##########
#step 1
##########
sample1 <- data.frame(age, height_cm_z)
sample2 <- data.frame(age, height_feet_z)

# sample1
fit1 <- kmeans(sample1, 2)
# sample2
fit2 <- kmeans(sample2, 2)

#compare
fit1
fit2

##########
#step 2
##########
sample1 <- data.frame(age, height_cm_z, weight_kg_z)
sample2 <- data.frame(age, height_feet_z, weight_stone_z)

# sample1
fit1 <- kmeans(sample1, 2)
# sample2
fit2 <- kmeans(sample2, 2)

#compare
fit1
fit2```

Now we have proven that both data samples will return the same result; when comparing cluster vector for both data samples (when normalized) will be relatively the same; some minor differences can occur. Fit function for cluster vector will at the end return same results.

Going back to original question that measuring unit can give different results. In terms of plotting data, we have proven this wrong. In terms of cluster analysis we have done only half of the proof. For reverse check, we will test – for example Sample1 – on original data and normalized data. Here we will see the importance of design size, test of variance and variance inequality.

```#~~~~~~~~~~~~~~~~~~~
#
# comparing normalized
# and non-normalized
# data sample
#~~~~~~~~~~~~~~~~~~~

#sample1 with cm
sample1 <- data.frame(age, height_cm)
sample1_z <- data.frame(age, height_cm_z)

#sample1 with feet
sample2 <- data.frame(age, height_feet)
sample2_z <- data.frame(age, height_feet_z)

# design size
var(sample1\$height_cm)
var(sample1_z\$height_cm_z)

var(sample2\$height_feet)
var(sample2_z\$height_feet_z)

#test variance of sample1 height and sample2 height
var.test(sample1\$height_cm, sample2\$height_feet)

#test variance of sample1 height and sample2 height
var.test(sample1_z\$height_cm_z, sample2_z\$height_feet_z)```

Test of equality of variance between non-normalized and normalized data shows normalized outperform the non-normalized. This reveals that in distance based clustering the data with smaller – when compared  with bigger – values (here is feet vs. cm) will result in bigger design effect. Therefore a normalization is recommended, in second test the ratio of the variance = 1, meaning that variance are equal (also p-value = 1). To finalize the test, let’s rerun the fit function for clustering with 2 clusters

```#comparison of Fit function for clustering with 2 clusters (on Sample1)
# sample1
fit1 <- kmeans(sample1, 2)
fit1_z <- kmeans(sample1_z, 2)

#compare
fit1
fit1_z```

The result is:

#non-normalized
Clustering vector:  1 1 2 2 1
Within cluster sum of squares by cluster:  327.3333 200.5000 (between_SS / total_SS =  48.7 %)

#normalized
Clustering vector:  2 1 2 1 2
Within cluster sum of squares by cluster:  1.605972 2.286963  (between_SS / total_SS =  99.2 %)
To summarize; the result of the normalized clustering is much better, since the WCSS is higher, resulting in total variance of data sets is explained for each cluster. We can assume that the belonging to each cluster when normalizing the height is much better when minimizing the differences within the group and maximizing it between the groups.

But to make things more complicated, you should normalize also age; and you will get completely different cluster centers 🙂

Complete code is available here: Effect of normalization of data.

Tagged with: , , ,
Posted in Uncategorized
 DAX Time Functions… on Time functions in DAX Advent of Code 2019… on Advent of Code 2019 challenge… Hangman game with R… on Hangman game with R Hangman game with R… on Hangman game with R Rob Pattyn on Installing SSIS, SSRS and SSAS…
Programs I Use Programs I Use Programs I Use Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018 TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Discover

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python and beyond

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Alessandro Alpi's Blog

SQL Server, Azure and DLM in a nutshell :D

Paul te Braak