SageMaker Regression on Abalone Data

Back to Blog | LearnableLoopAI.com | Portfolio of Projects |

1. Introduction

What is Abalone? It is a large marine gastropod mollusk that lives in coastal saltwater and is a member of the Haliotidae family. Abalone is often found around the waters of South Africa, Australia, New Zealand, Japan, and the west coast of North America. The abalone shell is flat and spiral-shaped with several small holes around the edges. It has a single shell on top with a large foot to cling to rocks and lives on algae. Sizes range from 4 to 10 inches. The interior of the shell has an iridescent mother of pearl appearance (Figure 1).

As a highly prized culinary delicacy (Figure 2), it has a rich, flavorful taste that is sweet buttery, and salty. Abalone is often sold live in the shell, but also frozen, or canned. It is among the world’s most expensive seafood. For preparation it is often cut into thick steaks and pan-fried. It can also be eaten raw.

2. Data Understanding

There is more information on the Abalone Dataset available at UCI data repository.

The dataset has 9 features:

Rings (number of)
sex (M, F, Infant)
Length (Longest shell measurement in mm)
Diameter (in mm)
Height (with meat in shell, in mm)
Whole Weight (whole abalone, in grams)
Shucked Weight (weight of meat, in grams)
Viscera Weight (gut weight after bleeding, in grams)
Shell Weight (after being dried, in grams)

The number of rings indicates the age of the abalone. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope. Not only is this a boring and time-consuming task but it is also relatively expensive in terms of waste. The remaining measurements, on the other hand, are readily achievable with the correct tools, and with much less effort. The purpose of this model is to estimate the abalone age, specifically the number of rings, based on the other features.

2.0 Setup

import urllib.request
import pandas as pd
import seaborn as sns
import random
# from IPython.core.debugger import set_trace
import boto3
import sagemaker
from sagemaker.image_uris import retrieve
from time import gmtime, strftime
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer
# import json
# from itertools import islice
# import math
# import struct
!pip install smdebug
from smdebug.trials import create_trial
import matplotlib.pyplot as plt
import re

Requirement already satisfied: smdebug in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (1.0.9)
Requirement already satisfied: pyinstrument>=3.1.3 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from smdebug) (3.4.2)
Requirement already satisfied: protobuf>=3.6.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from smdebug) (3.15.2)
Requirement already satisfied: numpy>=1.16.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from smdebug) (1.19.5)
Requirement already satisfied: boto3>=1.10.32 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from smdebug) (1.17.75)
Requirement already satisfied: packaging in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from smdebug) (20.9)
Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.10.32->smdebug) (0.10.0)
Requirement already satisfied: s3transfer<0.5.0,>=0.4.0 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.10.32->smdebug) (0.4.2)
Requirement already satisfied: botocore<1.21.0,>=1.20.75 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from boto3>=1.10.32->smdebug) (1.20.75)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.75->boto3>=1.10.32->smdebug) (2.8.1)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from botocore<1.21.0,>=1.20.75->boto3>=1.10.32->smdebug) (1.26.4)
Requirement already satisfied: six>=1.9 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from protobuf>=3.6.0->smdebug) (1.15.0)
Requirement already satisfied: pyinstrument-cext>=0.2.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from pyinstrument>=3.1.3->smdebug) (0.2.4)
Requirement already satisfied: pyparsing>=2.0.2 in /home/ec2-user/anaconda3/envs/python3/lib/python3.6/site-packages (from packaging->smdebug) (2.4.7)
[2021-05-24 13:28:27.782 ip-172-16-88-149:11789 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None

2.1 Download

The Abalone data is available in the libsvm format. Next, we will download it.

%%time
# Load the dataset
SOURCE_DATA = "abalone_libsvm.txt"
urllib.request.urlretrieve(
    "https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression/abalone", SOURCE_DATA
)

CPU times: user 18.7 ms, sys: 0 ns, total: 18.7 ms
Wall time: 1.69 s

('abalone_libsvm.txt', <http.client.HTTPMessage at 0x7f1b70d0c198>)

!head -10 ./{SOURCE_DATA}

15 1:1 2:0.455 3:0.365 4:0.095 5:0.514 6:0.2245 7:0.101 8:0.15
7 1:1 2:0.35 3:0.265 4:0.09 5:0.2255 6:0.0995 7:0.0485 8:0.07
9 1:2 2:0.53 3:0.42 4:0.135 5:0.677 6:0.2565 7:0.1415 8:0.21
10 1:1 2:0.44 3:0.365 4:0.125 5:0.516 6:0.2155 7:0.114 8:0.155
7 1:3 2:0.33 3:0.255 4:0.08 5:0.205 6:0.0895 7:0.0395 8:0.055
8 1:3 2:0.425 3:0.3 4:0.095 5:0.3515 6:0.141 7:0.0775 8:0.12
20 1:2 2:0.53 3:0.415 4:0.15 5:0.7775 6:0.237 7:0.1415 8:0.33
16 1:2 2:0.545 3:0.425 4:0.125 5:0.768 6:0.294 7:0.1495 8:0.26
9 1:1 2:0.475 3:0.37 4:0.125 5:0.5095 6:0.2165 7:0.1125 8:0.165
19 1:2 2:0.55 3:0.44 4:0.15 5:0.8945 6:0.3145 7:0.151 8:0.32

2.2 Read data into dataframe

df = pd.read_csv(
    SOURCE_DATA,
    sep=" ",
    encoding="latin1",
    names=[
        "age",
        "sex",
        "Length",
        "Diameter",
        "Height",
        "Whole.weight",
        "Shucked.weight",
        "Viscera.weight",
        "Shell.weight",
    ],
); df

	age	sex	Length	Diameter	Height	Whole.weight	Shucked.weight	Viscera.weight	Shell.weight
0	15	1:1	2:0.455	3:0.365	4:0.095	5:0.514	6:0.2245	7:0.101	8:0.15
1	7	1:1	2:0.35	3:0.265	4:0.09	5:0.2255	6:0.0995	7:0.0485	8:0.07
2	9	1:2	2:0.53	3:0.42	4:0.135	5:0.677	6:0.2565	7:0.1415	8:0.21
3	10	1:1	2:0.44	3:0.365	4:0.125	5:0.516	6:0.2155	7:0.114	8:0.155
4	7	1:3	2:0.33	3:0.255	4:0.08	5:0.205	6:0.0895	7:0.0395	8:0.055
...	...	...	...	...	...	...	...	...	...
4172	11	1:2	2:0.565	3:0.45	4:0.165	5:0.887	6:0.37	7:0.239	8:0.249
4173	10	1:1	2:0.59	3:0.44	4:0.135	5:0.966	6:0.439	7:0.2145	8:0.2605
4174	9	1:1	2:0.6	3:0.475	4:0.205	5:1.176	6:0.5255	7:0.2875	8:0.308
4175	10	1:2	2:0.625	3:0.485	4:0.15	5:1.0945	6:0.531	7:0.261	8:0.296
4176	12	1:1	2:0.71	3:0.555	4:0.195	5:1.9485	6:0.9455	7:0.3765	8:0.495

4177 rows × 9 columns

2.3 Convert from libsvm to csv format

The libsvm format is not suitable to explore the data with pandas. Next we will convert the data to csv format:

# Extracting the features values from  the libvsm format
features = [
    "sex",
    "Length",
    "Diameter",
    "Height",
    "Whole.weight",
    "Shucked.weight",
    "Viscera.weight",
    "Shell.weight",
]
for f in features:
    if f == "sex":
        df[f] = (df[f].str.split(":", n=1, expand=True)[1])
    else:
        df[f] = (df[f].str.split(":", n=1, expand=True)[1])
df

	age	sex	Length	Diameter	Height	Whole.weight	Shucked.weight	Viscera.weight	Shell.weight
0	15	1	0.455	0.365	0.095	0.514	0.2245	0.101	0.15
1	7	1	0.35	0.265	0.09	0.2255	0.0995	0.0485	0.07
2	9	2	0.53	0.42	0.135	0.677	0.2565	0.1415	0.21
3	10	1	0.44	0.365	0.125	0.516	0.2155	0.114	0.155
4	7	3	0.33	0.255	0.08	0.205	0.0895	0.0395	0.055
...	...	...	...	...	...	...	...	...	...
4172	11	2	0.565	0.45	0.165	0.887	0.37	0.239	0.249
4173	10	1	0.59	0.44	0.135	0.966	0.439	0.2145	0.2605
4174	9	1	0.6	0.475	0.205	1.176	0.5255	0.2875	0.308
4175	10	2	0.625	0.485	0.15	1.0945	0.531	0.261	0.296
4176	12	1	0.71	0.555	0.195	1.9485	0.9455	0.3765	0.495

4177 rows × 9 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             4177 non-null   int64 
 1   sex             4177 non-null   object
 2   Length          4177 non-null   object
 3   Diameter        4177 non-null   object
 4   Height          4177 non-null   object
 5   Whole.weight    4177 non-null   object
 6   Shucked.weight  4177 non-null   object
 7   Viscera.weight  4177 non-null   object
 8   Shell.weight    4177 non-null   object
dtypes: int64(1), object(8)
memory usage: 293.8+ KB

To understand the data better, we need to convert all the string types to numeric types.

df = df.astype({
    'age':'int32', 
    'sex':'int32',
    'Length':'float32',
    'Diameter':'float32',
    'Height':'float32',
    'Whole.weight':'float32',
    'Shucked.weight':'float32',
    'Viscera.weight':'float32',
    'Shell.weight':'float32',
    })

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             4177 non-null   int32  
 1   sex             4177 non-null   int32  
 2   Length          4177 non-null   float32
 3   Diameter        4177 non-null   float32
 4   Height          4177 non-null   float32
 5   Whole.weight    4177 non-null   float32
 6   Shucked.weight  4177 non-null   float32
 7   Viscera.weight  4177 non-null   float32
 8   Shell.weight    4177 non-null   float32
dtypes: float32(7), int32(2)
memory usage: 147.0 KB

df.isnull().values.any()

False

df.isnull().sum().sum()

sns.set(style="ticks", color_codes=True)
# g = sns.pairplot(df)
g = sns.pairplot(df, diag_kind='kde')

The data is now clean with no missing values. We will write this clean data to a file:

CLEAN_DATA = "abalone_clean.txt"
# df = df.sample(n=10, random_state=1)
df.to_csv(CLEAN_DATA, sep=',', header=None, index=False)

3. Data Preparation

What remains is to split the data into suitable partitions for modeling.

def split_data(
    FILE_TOTAL,
    FILE_TRAIN,
    FILE_VALID,
    FILE_TESTG,
    FRAC_TRAIN,
    FRAC_VALID,
    FRAC_TESTG,
):
    total = [row for row in open(FILE_TOTAL, "r")]
    train_file = open(FILE_TRAIN, "w")
    valid_file = open(FILE_VALID, "w")
    testg_file = open(FILE_TESTG, "w")

    num_total = len(total)
    num_train = int(FRAC_TRAIN*num_total)
    num_valid = int(FRAC_VALID*num_total)
    num_testg = int(FRAC_TESTG*num_total)

    sizes = [num_train, num_valid, num_testg]
    splits = [[], [], []]

    rand_total_ind = 0
    #set_trace()
    for split_ind,size in enumerate(sizes):
        for _ in range(size):
            if len(total)<1:
                print('ERROR. Make sure fractions are decimals.')
                break
            rand_total_ind = random.randint(0, len(total) - 1)
            #print('len(total) - 1',len(total) - 1)
            #print('rand_total_ind:',rand_total_ind)
            #print('total[rand_total_ind]:',total[rand_total_ind])
            splits[split_ind].append(total[rand_total_ind])
            total.pop(rand_total_ind)

    for row in splits[0]:
        train_file.write(row)
    print(f'Training data: {len(splits[0])} rows ({len(splits[0])/num_total})')

    for row in splits[1]:
        valid_file.write(row)
    print(f'Validation data: {len(splits[1])} rows ({len(splits[1])/num_total})')

    for row in splits[2]:
        testg_file.write(row)
    print(f'Testing data: {len(splits[2])} rows ({len(splits[2])/num_total})')

    train_file.close()
    valid_file.close()
    testg_file.close()

# Load the dataset
FILE_TOTAL = "abalone_clean.txt"
FILE_TRAIN = "abalone_train.csv"
FILE_VALID = "abalone_valid.csv"
FILE_TESTG = "abalone_testg.csv"
FRAC_TRAIN = .70
FRAC_VALID = .15
FRAC_TESTG = .15
split_data(
    FILE_TOTAL,
    FILE_TRAIN,
    FILE_VALID,
    FILE_TESTG,
    FRAC_TRAIN,
    FRAC_VALID,
    FRAC_TESTG,
)

Training data: 2923 rows (0.6997845343548001)
Validation data: 626 rows (0.1498683265501556)
Testing data: 626 rows (0.1498683265501556)

4. Modeling

Before we build a model, we want to position the data on S3.

4.1 Position data on S3

def write_to_s3(fobj, bucket, key):
    return (
        boto3.Session(region_name=region)
        .resource("s3")
        .Bucket(bucket)
        .Object(key)
        .upload_fileobj(fobj)
    )

def upload_to_s3(bucket, prefix, channel, filename):
    fobj = open(filename, "rb")
    key = f"{prefix}/{channel}/{filename}"
    url = f"s3://{bucket}/{key}"
    print(f"Writing to {url}")
    write_to_s3(fobj, bucket, key)

# upload the files to the S3 bucket
upload_to_s3(bucket, prefix, "train", FILE_TRAIN)
upload_to_s3(bucket, prefix, "valid", FILE_VALID)
upload_to_s3(bucket, prefix, "testg", FILE_TESTG)

Writing to s3://learnableloopai-blog/abalone/train/abalone_train.csv
Writing to s3://learnableloopai-blog/abalone/valid/abalone_valid.csv
Writing to s3://learnableloopai-blog/abalone/testg/abalone_testg.csv

4.2 Setup data channels

s3_train_data = f"s3://{bucket}/{prefix}/train"
print(f"training files will be taken from: {s3_train_data}")
s3_valid_data = f"s3://{bucket}/{prefix}/valid"
print(f"validation files will be taken from: {s3_valid_data}")
s3_testg_data = f"s3://{bucket}/{prefix}/testg"
print(f"testing files will be taken from: {s3_testg_data}")

s3_output = f"s3://{bucket}/{prefix}/output"
print(f"training artifacts output location: {s3_output}")

# generating the session.s3_input() format for fit() accepted by the sdk
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)
valid_data = sagemaker.inputs.TrainingInput(
    s3_valid_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)
testg_data = sagemaker.inputs.TrainingInput(
    s3_testg_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

training files will be taken from: s3://learnableloopai-blog/abalone/train
validation files will be taken from: s3://learnableloopai-blog/abalone/valid
testing files will be taken from: s3://learnableloopai-blog/abalone/testg
training artifacts output location: s3://learnableloopai-blog/abalone/output

4.3 Training a Linear Learner model

First, we retrieve the image for the Linear Learner Algorithm according to the region.

Then we create an estimator from the SageMaker Python SDK using the Linear Learner container image and we setup the training parameters and hyperparameters configuration.

# 
# get the linear learner image
image_uri = retrieve("linear-learner", boto3.Session().region_name, version="1")

%%time
from sagemaker.debugger import rule_configs, Rule, DebuggerHookConfig, CollectionConfig
save_interval = 3
sess = sagemaker.Session()
job_name = "abalone-regression-" + strftime("%H-%M-%S", gmtime())
print("Training job: ", job_name)
linear = sagemaker.estimator.Estimator(
    image_uri=image_uri, 
    role=role, 
    instance_count=1, 
    instance_type="ml.m4.xlarge", 
    #instance_type="local", 
    input_mode="File", 
    output_path=s3_output, 
    base_job_name="abalone-regression-sagemaker",
    sagemaker_session=sess, 
    #hyperparameters=hyperparameters,
    #train_max_run=100
    debugger_hook_config=DebuggerHookConfig(
        #s3_output_path="s3://learnableloopai-blog/abalone/output_debugger",
        s3_output_path=s3_output,
        collection_configs=[
            CollectionConfig(
                name="metrics",
                parameters={
                    "save_interval": str(save_interval)
                }
            ),
#             CollectionConfig(
#                 name="feature_importance",
#                 parameters={
#                     "save_interval": str(save_interval)
#                 }
#             ),
#             CollectionConfig(
#                 name="full_shap",
#                 parameters={
#                     "save_interval": str(save_interval)
#                 }
#             ),
#             CollectionConfig(
#                 name="average_shap",
#                 parameters={
#                     "save_interval": str(save_interval)
#                 }
#             ),            
#             CollectionConfig(
#                 name="mini_batch_size",
#                 parameters={
#                     "save_interval": str(save_interval)
#                 }
#             )
        ]
    ),
    rules=[
        Rule.sagemaker(
            rule_configs.loss_not_decreasing(),
            rule_parameters={
                "collection_names": "metrics",
                "num_steps": str(save_interval*2),
            },
        ),
#         Rule.sagemaker(
#             rule_configs.overtraining(),
#             rule_parameters={
#                 "collection_names": "metrics",
#                 "patience_validation": str(10),
#             },
#         ),
#         Rule.sagemaker(
#             rule_configs.overfit(),
#             rule_parameters={
#                 "collection_names": "metrics",
#                 "patience": str(10),
#             },
#         )
    ]
)

Training job:  abalone-regression-13-02-13
CPU times: user 13.3 ms, sys: 45 µs, total: 13.4 ms
Wall time: 12.5 ms

linear.set_hyperparameters(
    feature_dim=8,
    epochs=16,
    wd=0.01,
    loss="absolute_loss",
    predictor_type="regressor",
    normalize_data=True,
    optimizer="adam",
    mini_batch_size=100,
    lr_scheduler_step=100,
    lr_scheduler_factor=0.99,
    lr_scheduler_minimum_lr=0.0001,
    learning_rate=0.1,
)

%%time
linear.fit(inputs={
    "train": train_data, 
    "validation": valid_data,
    #"test": testg_data
    }, 
    wait=False) #cell won't block until done

CPU times: user 35.6 ms, sys: 130 µs, total: 35.7 ms
Wall time: 231 ms

6. Deployment

%%time
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge")
print(f"\nEndpoint: {linear_predictor.endpoint_name}")

---------------!
Endpoint: linear-learner-2021-05-21-19-45-14-793
CPU times: user 270 ms, sys: 12.6 ms, total: 283 ms
Wall time: 7min 32s

6.1 Test Inference

Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, we are going to configure the predictor object to parse contents of type text/csv and deserialize the reply received from the endpoint to json format.

# configure the predictor to accept to serialize csv input and parse the reponse as json
linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

We use the test file containing the records of the data that we kept to test the model prediction. Run the following cell multiple times to perform inference:

%%time
# get a testing sample from the test file
test_data = [row for row in open(FILE_TESTG, "r")]
sample = random.choice(test_data).split(",")
actual_age = sample[0]
payload = sample[1:]  # removing actual age from the sample
payload = ",".join(map(str, payload))

# invoke the predicor and analyise the result
result = linear_predictor.predict(payload)

# extract the prediction value
result = round(float(result["predictions"][0]["score"]), 2)

accuracy = str(round(100 - ((abs(float(result) - float(actual_age)) / float(actual_age)) * 100), 2))
print(f"Actual age: {actual_age}\nPrediction: {result}\nAccuracy: {accuracy}")

Actual age: 9
Prediction: 8.83
Accuracy: 98.11
CPU times: user 4.66 ms, sys: 0 ns, total: 4.66 ms
Wall time: 19.6 ms

6.2 Delete the Endpoint

Having an endpoint running will incur some costs. Therefore as a clean-up job, we should delete the endpoint.

sagemaker.Session().delete_endpoint(linear_predictor.endpoint_name)
print(f"Deleted {linear_predictor.endpoint_name} successfully!")

Deleted linear-learner-2021-05-21-19-45-14-793 successfully!