Robust IRIS Dataset?

This blog post was born out of pure curiosity about the robustness of the IRIS Dataset. Biological datasets do not need to be that big in comparison to datasets of customers, consumption, stock and anything that might be volatile.

When still at the university, on one occasion I can remember, we were measuring the length of the frog legs and other frogy features. And after just a couple of measures, the further prediction was steady. Also, any kind of sampling was (RS and SRS, cluster/stratified sampling, sampling with replacements and many other creative ways of sampling) proven to be rigid, robust and would converge quickly to a good result.

Therefore, I have decided to put the IRIS dataset to the test, using a simple classification method. Calculating first the simple euclidian distance, following by finding the neighbour and based on that checking the membership of the type of flowers with the labels.

Accuracy of the prediction was tested by mapping the original species with predicted ones. And the test was, how large can a train dataset be in order to still get a good result.

After some Python and R code, the results were in.

I have tested following pairs (train:test sample size):

  • 80% – 20%
  • 60% – 40%
  • 50% – 50%
  • 30% – 70%
  • 10% – 90%

Note, that the IRIS dataset has 150 observations, each evenly distributed among three species. Following Python code loop through the calculation of euclidean distance.

for x in range(3000):
    exec(open("./classification.py").read(), globals())
    x += 1

At the end I have generated the file:

predictions

With these results, simple R code to generate the scatter plot was used:

library(ggplot2)
setwd("C:\\Predictions\\")
df_pred <- data.frame(read.table("results_split.txt", sep=";"))
p <- ggplot(df_pred, aes(df_pred$V3, df_pred$V1)) 
p <- p + geom_point(aes(df_pred$V3))
p <- p + labs(x="Accuracy (%) of predictions", y="Size of training subset")
p

Which resulted as:

Plot

The graph clearly shows that 10% of training set (10% out of 150 observations) can generate very accurate predictions every 1,35x times.

Other pairs, when taking 30% or 50% of training set, will for sure give close to 100% accuracy almost every time.

Snippet of Python code to generate euclidean distance:

def eucl_dist(set1, set2, length):
    distance = 0
    for x in range(length):
        distance += pow(set1[x] - set2[x], 2)
    return math.sqrt(distance)

and neighbours:

def find_neighbors(train, test, nof_class):
    distances = []
    length_dist = len(test) - 1
    for x in range(len(train)):
        dist = eucl_dist(test, train[x], length_dist)
        distances.append((train[x],dist))
    distances.sort(key=operator.itemgetter(1))
    neighbour = []
    for x in range(nof_class):
        neighbour.append(distances[x][0])
    return neighbour

 

Conclusion, IRIS dataset is – due to the nature of the measurments and observations – robust and rigid; one can get very good accuracy results on a small training set. Everything beyond 30% for training the model, is for this particular case, just additional overload.

Happy R & Python coding! 🙂

Advertisements
Tagged with: , , , ,
Posted in Uncategorized
3 comments on “Robust IRIS Dataset?
  1. […] article was first published on R – TomazTsql, and kindly contributed to […]

    Like

  2. […] article was first published on R – TomazTsql, and kindly contributed to […]

    Like

  3. […] Tomaz Kastrun shows how robust the iris data set is: […]

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Categories
Follow TomazTsql on WordPress.com
Programs I Use
Programs I Use
Programs I Use
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
Discover

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R

tenbulls.co.uk

attaining enlightenment with sql server, .net, biztalk, windows and linux

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Clocksmith Games

We make games we love to play

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R

Search Msdn

Tomaz doing BI and DEV with SQL Server and R

R-bloggers

Tomaz doing BI and DEV with SQL Server and R

Ms SQL Girl

Julie Koesmarno's Journey In Data, BI and SQL World

R-bloggers

R news and tutorials contributed by (750) R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

SQL Server, Azure and .net in a nutshell :D

Paul te Braak

Business Intelligence Blog

Sql Server Insane Asylum (A Blog by Pat Wright)

Information about SQL Server from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...

SQLPam's Blog

Life changes fast and this is where I occasionally take time to ponder what I have learned and experienced. A lot of focus will be on SQL and the SQL community – but life varies.

William Durkin

William Durkin a blog on SQL Server, Replication, Performance Tuning and whatever else.

%d bloggers like this: