GitHub - thomaschoi143/ml-deployment-workshop

Creating ML Model

This section is going to demonstrate on how to source data on the web, cleaning of data in pandas dataframe using string manipulation and conversion of all attributes to numeric for model training with Sklearn.

Prerequisites

Installation of Jupiter notebook (Install Anaconda https://www.anaconda.com/products/navigator)
Access to website with databank to source from: https://www.kaggle.com/datasets, https://archive.ics.uci.edu/

Instructions

1. Determine a theme or goal that the Model will output and predict i.e movie ratings, wine quality, cost of product, etc.
1. Find a dataset with suitable attributes to be able to predict our goal from either kagglem, UC Irvine Machine Learning Repository or other sources online.
1. Create a new Jupiter notebook file and upload the csv file of the dataset into a dataframe.
4.Remove pointless attributes such as ‘product ID’, ‘Person names’, ‘Index’, etc; and data points with missing values.
5.Clean the data. For non-numeric attributes, use simple string manipulation to ensure all values makes sense and have consistent meaning. For numeric attributes, ensure all values contain integer/continuous values only and nothing else e.g $400→400.
6.Use encoding techniques to transform categorical attributes into numeric:https://medium.com/@brandon93.w/converting-categorical-data-into-numerical-form-a-practical-guide-for-data-science-99fdf42d0e10#:~:text=Frequency Encoding,with high cardinality categorical data.
7.Split the data into train and test data sets to test model performance(optional/advanced)
8.Using Sklearn, import and train Linear Regression model for continuous ‘class’(predicted) variable e.g house prices. Import and train Decision trees model for categorical class labels e.g wine quality(0-10).

Improvement(advanced)

Instead of just removing data points with missing values, fill in the missing values with most common/average value of the missing attribute. For non-numeric attributes, this can be challenging as filling in the most common value doesn’t always makes sense. Research and manual input of such missing values could potentially be required.