Advent of 2022, Day 20 – Handling kernels, python packages, YAML files in notebooks and keeping structure and good practices

In the series of Azure Machine Learning posts:

When creating notebooks, it is always a good way to have the dependencies included. Whether it is a particular version of a package, a separate script file or an installation requirement.

Selecting an environment or kernel can be an issue if it is not correctly initiated with the code. And you can also check the kernels with a simple python code:

!jupyter kernelspec list

#or activate prefered kernel
!conda activate python38-azureml
#!jupyter kernelspec cmd install python38-azureml

Handling packages is another mandatory step to consider, especially if you are adding additional packages on top of the environment. Go check the installed packages or install a single one with:

!pip list

#install separate package
!pip install pandas

The best way is to use a separate TXT file (requirements.txt) and specify the packages (with or without versions) for installation:

!pip install -r Day20-requirements.txt

But creating a TXT file can always be added into notebooks, as a good practice and flawless installation.

Fig 1: Installing dependencies with TXT file

Python offers a great way to structure the code as well, simply using folders and naming conventions.

import os

dependencies_dir = "./Day20-dependencies"
os.makedirs(dependencies_dir, exist_ok=True)

And you can always store YAML or TXT files here and keep your code traceable, reusable and structured.

Fig 2: Definition of a YAML file

Another way to have your code organised is to store it in smaller classes or functions for later reuse. Here is an example of a mlflow python file that will be stored separately and called each time, when training of the model will take place.

%%writefile {train_src_dir}/main.py
import os
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

def main():
    """Main function of the script."""

    # input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--data", type=str, help="path to input data")
    parser.add_argument("--test_train_ratio", type=float, required=False, default=0.25)
    parser.add_argument("--n_estimators", required=False, default=100, type=int)
    parser.add_argument("--learning_rate", required=False, default=0.1, type=float)
    parser.add_argument("--registered_model_name", type=str, help="model name")
    args = parser.parse_args()
   
    # Start Logging
    mlflow.start_run()

    # enable autologging
    mlflow.sklearn.autolog()

    ###################
    #
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print("input data:", args.data)
    
    credit_df = pd.read_excel(args.data, header=1, index_col=0)

    mlflow.log_metric("num_samples", credit_df.shape[0])
    mlflow.log_metric("num_features", credit_df.shape[1] - 1)

    train_df, test_df = train_test_split(
        credit_df,
        test_size=args.test_train_ratio,
    )
    ####################
    ##################
    # Extracting the label column
    y_train = train_df.pop("default payment next month")

    # convert the dataframe values to array
    X_train = train_df.values

    # Extracting the label column
    y_test = test_df.pop("default payment next month")

    # convert the dataframe values to array
    X_test = test_df.values

    print(f"Training with data of shape {X_train.shape}")

    clf = GradientBoostingClassifier(
        n_estimators=args.n_estimators, learning_rate=args.learning_rate
    )
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    print(classification_report(y_test, y_pred))
    ###################
    ##########################
    # Registering the model to the workspace
    print("Registering the model via MLFlow")
    mlflow.sklearn.log_model(
        sk_model=clf,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=clf,
        path=os.path.join(args.registered_model_name, "trained_model"),
    )
    ###########################
    #
    ###########################
    
    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()

In addition, you can always store the sample data in files for testing purposes. Another good practice of keeping test data logged and stored (in this case as a JSON file).

%%writefile {test_inference_dir}/sample-request.json
{
  "input_data": {
    "columns": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
    "index": [0, 1],
    "data": [
            [200,2,2,1,24,2,2,-1,-1,-2,-2,33,31,69,0,0,0,0,689,0,0],
            [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 10]
        ]
  }
}

Compete set of code, documents, notebooks, and all of the materials will be available at the Github repository: https://github.com/tomaztk/Azure-Machine-Learning

Happy Advent of 2022!

Tagged with: , , , , , ,
Posted in Azure Machine Learning, Uncategorized
5 comments on “Advent of 2022, Day 20 – Handling kernels, python packages, YAML files in notebooks and keeping structure and good practices
  1. […] Dec 20: Handling kernels, python packages, YAML files in notebooks and keeping structure and good pract… […]

    Like

  2. […] Tomaz Kastrun nears the end of the Azure ML advent. Day 20 covers package requirements and other niceties: […]

    Like

  3. […] Dec 20: Handling kernels, python packages, YAML files in notebooks and keeping structure and good practices […]

    Like

  4. […] Dec 20: Handling kernels, python packages, YAML files in notebooks and keeping structure and good practices […]

    Like

  5. […] Dec 20: Handling kernels, python packages, YAML files in notebooks and keeping structure and good pract… […]

    Like

Leave a comment

Follow TomazTsql on WordPress.com
Programs I Use: SQL Search
Programs I Use: R Studio
Programs I Use: Plan Explorer
Rdeči Noski – Charity

Rdeči noski

100% of donations made here go to charity, no deductions, no fees. For CLOWNDOCTORS - encouraging more joy and happiness to children staying in hospitals (http://www.rednoses.eu/red-noses-organisations/slovenia/)

€2.00

Top SQL Server Bloggers 2018
TomazTsql

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Discover WordPress

A daily selection of the best content published on WordPress, collected for you by humans who love to read.

Revolutions

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

tenbulls.co.uk

tenbulls.co.uk - attaining enlightenment with the Microsoft Data and Cloud Platforms with a sprinkling of Open Source and supporting technologies!

SQL DBA with A Beard

He's a SQL DBA and he has a beard

Reeves Smith's SQL & BI Blog

A blog about SQL Server and the Microsoft Business Intelligence stack with some random Non-Microsoft tools thrown in for good measure.

SQL Server

for Application Developers

Business Analytics 3.0

Data Driven Business Models

SQL Database Engine Blog

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

Search Msdn

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

Tomaz doing BI and DEV with SQL Server and R, Python, Power BI, Azure and beyond

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Data Until I Die!

Data for Life :)

Paul Turley's SQL Server BI Blog

sharing my experiences with the Microsoft data platform, SQL Server BI, Data Modeling, SSAS Design, Power Pivot, Power BI, SSRS Advanced Design, Power BI, Dashboards & Visualization since 2009

Grant Fritchey

Intimidating Databases and Code

Madhivanan's SQL blog

A modern business theme

Alessandro Alpi's Blog

DevOps could be the disease you die with, but don’t die of.

Paul te Braak

Business Intelligence Blog

Sql Insane Asylum (A Blog by Pat Wright)

Information about SQL (PostgreSQL & SQL Server) from the Asylum.

Gareth's Blog

A blog about Life, SQL & Everything ...