Encoding Multi-label Columns on Tabular Data
The Lore
Aloha ( ^_^)
Sooo, I found a dataset not long ago from Alexey Grigorev free ML course, it's pretty interesting check out the course here , and the link to the dataset is right here.
and while I was playing around with the dataset I noticed that the amount of missing values in a particular column called Market Category
was much, and filling in the missing values with it's measure of central tendency (mean, mode, median), would not be a good approach.
So, being a little bit bored, I decided to predict the missing values.
So after couple of hours trying to search for how to encode this this column, I gave up, and I went to seek help from my friend.
When he was done explaining (and by explaining I mean sending me links to articles lol)... I knew I was in trouble ... But I can't give up now (plus I really had nothing better to do)
What Multi-label Classification is
Now, before we go into the main stuffs, I'll first try to explain what multi-label classification is.
According to wikipedia, multi-label classification is a variant of the classification problem where multiple labels may be assigned to each instance.
So, basically involves predicting one or more labels for a single record, and technically you could have 2 or more "true values". I'll to explain what I mean.
( ^_^)
You could think about the encoding process of multi-label columns to be like this.
Lets say you have a couple of balls, and they're all of different colours, some blue, some red, some back, some white etc. If you wanted to show/represent the color of each ball you could do something like this:
# | blue | red | black | white |
---|---|---|---|---|
blue-ball | 1 | 0 | 0 | 0 |
red-ball | 0 | 1 | 0 | 0 |
black-ball | 0 | 0 | 1 | 0 |
white-ball | 0 | 0 | 0 | 1 |
The matrix show the ball and it's color, note that 1 represents true while 0 represents false in the matrix. So you can see that for each ball,
it's corresponding color column is 1 while the rest column-rows would be zero.
Now imagine we have a dark-red ball and we want to represent it using the matrix. If we all agree that dark-red is a mix of red and black,
we could do something like this:
# | blue | red | black | white |
---|---|---|---|---|
blue-ball | 1 | 0 | 0 | 0 |
red-ball | 0 | 1 | 0 | 0 |
black-ball | 0 | 0 | 1 | 0 |
white-ball | 0 | 0 | 0 | 1 |
dark-red-ball | 0 | 1 | 1 | 0 |
I'm sure you get the idea by now :)
you could do the same for a dark-blue ball, a sky-blue ball, a grey ball and all kinds of balls
So basically instead of creating a new column (feature) you just combined 2 different columns (features -> red and black) to make a sorta new feature.
Eventhough this is not a perfect illustration, but it does describes the concept of multi-label classification and encoding simply : )
Some real word application of this could be in:
- classifying the genre of a song or a movie
- adding tags to a product in a shop/store. etc.
The Main Stuff - Multi-label Encoding
The dataset is pretty simple and easy to work with. I would go quickly through the basic steps I took to solve this challenge.
Lets import the libraries we'd need:
import numpy as np
import pandas as pd
So what we really need to do is to prep the data and format it so that it would be easy for us to train a model on.
Next step is to read the data and check for missing values:
df = pd.read_csv('data.csv')
print(df.isna().sum())
Output:
Make 0
Model 0
Year 0
Engine Fuel Type 3
Engine HP 69
Engine Cylinders 30
Transmission Type 0
Driven_Wheels 0
Number of Doors 6
Market Category 3742
Vehicle Size 0
Vehicle Style 0
highway MPG 0
city mpg 0
Popularity 0
MSRP 0
You can see that the amount of missing columns in the Market Category
is about 3.742 k
I personally do not think it's a good idea to fill the missing values with a measure of central tendency neither is it a really good idea to drop the missing rows.
But for other columns with missing values you can easily fill the missing values with their central tendencies. So lets do that.
df['Engine Fuel Type'].fillna(method='ffill', inplace=True)
df['Engine HP'].fillna(df['Engine HP'].median(), inplace = True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].mean(), inplace = True)
df['Number of Doors'].fillna(method='ffill', inplace = True)
With this you've filled the missing columns. You might be wondering why I chose to fill the missing values with the median for some and for some I used the mean, well it was simply a choice I made based on the data distribution in each columns (check for skewness etc).
Next thing to do is to extract unique labels from our Market Category
column. But before that let me show you how the Market Category values look like (First 4 rows):
0 Factory Tuner,Luxury,High-Performance
1 Luxury,Performance
2 Luxury,High-Performance
3 Luxury,Performance
Basically the values in this column are strings but if you look closer you'll see that they have a list-like structure.
Lets write some code to extract the unique labels:
lst = df['Market Category'].unique()
category_list = []
#loop extracts unique label for the market category column
for i in lst:
if isinstance(i, str): #prevents 'nan' values
if ',' in i:
new_list = i.split(',')
for j in new_list:
if j in category_list:
print(j, 'already found')
continue
else:
category_list.append(j)
print(j, 'added')
else:
if i in category_list:
print(i, 'already found')
continue
else:
category_list.append(i)
print(i, 'added')
This gets the work done. It's a pretty simple code, what it does is to check if there is a
comma in the present row value, if there is then it separates the string and add each unique label to a list (category_list). there is also an
if-else
statement to make sure the same values are not added to the list twice. Let me show you the final output list.
output:
['Factory Tuner',
'Luxury',
'High-Performance',
'Performance',
'Flex Fuel',
'Hatchback',
'Hybrid',
'Diesel',
'Exotic',
'Crossover']
Ten unique labels (It's a good sign 🙏🏿).
Okay so now we have our unique labels. Lets quickly create our prediction dataframe and move on to the encoding.
#create new df containg rows with missing market category (test_df)
prediction_df = df.loc[df['Market Category'].isna()]
#drop rows with missing market category
df.drop(df.index[df['Market Category'].isna()], inplace=True)
#drop market category column
prediction_df.drop('Market Category', axis=1, inplace=True)
Now that that's done let get to the encoding part.
At first I pretty much assumed a simple one hot encoder could do the trick... it didn't work. After checking online for a while I finally found something interesting, it was a module from sklearn called MultiLabelBinarizer
. It is a module that encodes categorical columns but unlike the one hot encoder, It can encode multi-label columns. It would've been a perfect solution but It could not detect the 10 unique labels. Even after manually specifying the labels. Another reason was because of the data type of the column, but even after I changed the data type it still did not encode it correctly.
At this point I was pretty tired so I decided to create my own encoder
it's just a simple encoder, how hard could be it?I
should've know better lol
Creating the encoder took sometime ngl.
It was ultimately skill issue sha.
Anyways, the way the encoding function utimately works is simple. We would get each unique label and map each label to a number, to be stored in a dictionary, then a list is created, filled with zeros, except for indexes corresponding to number encoding for each label.
def encoder(df_col, categories):
#create a dictionary map
label_to_int = dict((c, i) for i, c in enumerate(categories))
#encode to integer
label_encoded = [[[label_to_int[label] for label in cell.split(',') ] for cell in row] for row in df_col]
#create one hot list
oh_list = list()
for row in label_encoded:
for cell in row:
cell_enc = [0 for _ in range(len(categories))]
for label in cell:
cell_enc[label] = 1
oh_list.append(cell_enc)
return oh_list
The function collects the column you want to encode and the unique labels of that column as arguments, encodes it and returns the column.
After this I use these functions to create an encoder class, like so:
class MultiLabelEncoder():
def __init__(self,
multilabel_column:pd.Series,
unique_labels:set=set(),
delimiter:str=','):
#we don't want no missing values over here
self.mlc = multilabel_column.dropna()
self.unique_labels = unique_labels
self.delimiter = delimiter
self.encodings = None
def extract_unique_labels(self,):
unique_categories = self.mlc.unique()
for cat in unique_categories:
if isinstance(cat, str):
if self.delimiter in cat:
row_cats = cat.split(self.delimiter)
for row_cat in row_cats:
self.unique_labels.add(row_cat)
else:
self.unique_labels.add(row_cat)
return
def multilabel_oh_encode(self):
label_to_int = dict((label, index) for index, label in enumerate(self.unique_labels))
label_encoded = [[label_to_int[label] for label in row.split(self.delimiter) ] for row in self.mlc]
#create one hot list
ohe_list = list()
if len(self.unique_labels) == 0:
raise ValueError('No unique labels, have you tried calling the `extract_unique_labels` method?')
for row in label_encoded:
row_enc = [0 for _ in range(len(self.unique_labels))]
for label in row:
row_enc[label] = 1
ohe_list.append(row_enc)
self.encodings = pd.Series(ohe_list)
return
This makes it super easy to use anywhere like so:
df_y = df.pop('Market Category')
encoder = MultiLabelEncoder(multilabel_column=df_y)
encoder.extract_unique_labels()
encoder.multilabel_oh_encode()
df_y_encoded = encoder.encodings
There, now we all can encode our multi-label columns anytime we encounter them
( ^_^)
And that's basically it
These articles were a lot of help to me:
Ohh yeah, you can get the full source code here
That's all for now, until next time : )