Encoding Multi-label Columns on Tabular Data

The Lore

Aloha ( ^_^)
Sooo, I found a dataset not long ago from Alexey Grigorev free ML course, it's pretty interesting check out the course here , and the link to the dataset is right here.

and while I was playing around with the dataset I noticed that the amount of missing values in a particular column called Market Category was much, and filling in the missing values with it's measure of central tendency (mean, mode, median), would not be a good approach.
So, being a little bit bored, I decided to predict the missing values.

PS: I had no idea there was such a thing as multi-label classification, prediction or encoding

So after couple of hours trying to search for how to encode this this column, I gave up, and I went to seek help from my friend.
When he was done explaining (and by explaining I mean sending me links to articles lol)... I knew I was in trouble ... But I can't give up now (plus I really had nothing better to do)

What Multi-label Classification is

Now, before we go into the main stuffs, I'll first try to explain what multi-label classification is.
According to wikipedia, multi-label classification is a variant of the classification problem where multiple labels may be assigned to each instance.

So, basically involves predicting one or more labels for a single record, and technically you could have 2 or more "true values". I'll to explain what I mean.

The concept of having multiple "true" values for a record, was strange to me initially.

( ^_^)

You could think about the encoding process of multi-label columns to be like this.
Lets say you have a couple of balls, and they're all of different colours, some blue, some red, some back, some white etc. If you wanted to show/represent the color of each ball you could do something like this:

# blue red black white
blue-ball 1 0 0 0
red-ball 0 1 0 0
black-ball 0 0 1 0
white-ball 0 0 0 1

The matrix show the ball and it's color, note that 1 represents true while 0 represents false in the matrix. So you can see that for each ball,
it's corresponding color column is 1 while the rest column-rows would be zero.
Now imagine we have a dark-red ball and we want to represent it using the matrix. If we all agree that dark-red is a mix of red and black,
we could do something like this:

# blue red black white
blue-ball 1 0 0 0
red-ball 0 1 0 0
black-ball 0 0 1 0
white-ball 0 0 0 1
dark-red-ball 0 1 1 0

I'm sure you get the idea by now :)
you could do the same for a dark-blue ball, a sky-blue ball, a grey ball and all kinds of balls

So basically instead of creating a new column (feature) you just combined 2 different columns (features -> red and black) to make a sorta new feature.

Eventhough this is not a perfect illustration, but it does describes the concept of multi-label classification and encoding simply : )

Some real word application of this could be in:

The Main Stuff - Multi-label Encoding
PS: This write up would focus on data processing, specifically encoding the Multi-label Column. We would not train any model for now

The dataset is pretty simple and easy to work with. I would go quickly through the basic steps I took to solve this challenge.

Lets import the libraries we'd need:

import numpy as np
import pandas as pd

So what we really need to do is to prep the data and format it so that it would be easy for us to train a model on.

Next step is to read the data and check for missing values:

df = pd.read_csv('data.csv')
print(df.isna().sum())
Output:  
	Make                    0
	Model                   0
	Year                    0
	Engine Fuel Type        3
	Engine HP              69
	Engine Cylinders       30
	Transmission Type       0
	Driven_Wheels           0
	Number of Doors         6
	Market Category      3742
	Vehicle Size            0
	Vehicle Style           0
	highway MPG             0
	city mpg                0
	Popularity              0
	MSRP                    0

You can see that the amount of missing columns in the Market Category is about 3.742 k
I personally do not think it's a good idea to fill the missing values with a measure of central tendency neither is it a really good idea to drop the missing rows.
But for other columns with missing values you can easily fill the missing values with their central tendencies. So lets do that.

df['Engine Fuel Type'].fillna(method='ffill', inplace=True)
df['Engine HP'].fillna(df['Engine HP'].median(), inplace = True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].mean(), inplace = True)
df['Number of Doors'].fillna(method='ffill', inplace = True)

With this you've filled the missing columns. You might be wondering why I chose to fill the missing values with the median for some and for some I used the mean, well it was simply a choice I made based on the data distribution in each columns (check for skewness etc).

Next thing to do is to extract unique labels from our Market Category column. But before that let me show you how the Market Category values look like (First 4 rows):

0    Factory Tuner,Luxury,High-Performance
1                       Luxury,Performance
2                  Luxury,High-Performance
3                       Luxury,Performance

Basically the values in this column are strings but if you look closer you'll see that they have a list-like structure.
Lets write some code to extract the unique labels:

lst = df['Market Category'].unique()

category_list = []
#loop extracts unique label for the market category column 
for i in lst:
	if isinstance(i, str):     #prevents 'nan' values
		if ',' in i:
			new_list = i.split(',')
			for j in new_list:
				if j in category_list:
					print(j, 'already found')
					continue
				else:
					category_list.append(j)
					print(j, 'added')
		else:
			if i in category_list:
				print(i, 'already found')
				continue
			else:
				category_list.append(i)
				print(i, 'added')

This gets the work done. It's a pretty simple code, what it does is to check if there is a
comma in the present row value, if there is then it separates the string and add each unique label to a list (category_list). there is also an
if-else statement to make sure the same values are not added to the list twice. Let me show you the final output list.

output:
   ['Factory Tuner',
	'Luxury',
	'High-Performance',
	'Performance',
	'Flex Fuel',
	'Hatchback',
	'Hybrid',
	'Diesel',
	'Exotic',
	'Crossover']

Ten unique labels (It's a good sign 🙏🏿).
Okay so now we have our unique labels. Lets quickly create our prediction dataframe and move on to the encoding.

#create new df containg rows with missing market category (test_df)
prediction_df = df.loc[df['Market Category'].isna()]
#drop rows with missing market category
df.drop(df.index[df['Market Category'].isna()], inplace=True)
#drop market category column
prediction_df.drop('Market Category', axis=1, inplace=True)

Now that that's done let get to the encoding part.

At first I pretty much assumed a simple one hot encoder could do the trick... it didn't work. After checking online for a while I finally found something interesting, it was a module from sklearn called MultiLabelBinarizer. It is a module that encodes categorical columns but unlike the one hot encoder, It can encode multi-label columns. It would've been a perfect solution but It could not detect the 10 unique labels. Even after manually specifying the labels. Another reason was because of the data type of the column, but even after I changed the data type it still did not encode it correctly.

At this point I was pretty tired so I decided to create my own encoder

Creating the encoder took sometime ngl.
It was ultimately skill issue sha.

what_did_it_cost.jpg

Anyways, the way the encoding function utimately works is simple. We would get each unique label and map each label to a number, to be stored in a dictionary, then a list is created, filled with zeros, except for indexes corresponding to number encoding for each label.

 def encoder(df_col, categories):
        #create a dictionary map
	label_to_int = dict((c, i) for i, c in enumerate(categories))

	#encode to integer
	label_encoded = [[[label_to_int[label] for label in cell.split(',') ] for cell in row] for row in df_col]

	#create one hot list
	oh_list = list()

	for row in label_encoded:
		for cell in row:
			cell_enc = [0 for _ in range(len(categories))]
			for label in cell:
				cell_enc[label] = 1
			oh_list.append(cell_enc)

	return oh_list 

The function collects the column you want to encode and the unique labels of that column as arguments, encodes it and returns the column.

After this I use these functions to create an encoder class, like so:

class MultiLabelEncoder():
    def __init__(self, 
		         multilabel_column:pd.Series, 
		         unique_labels:set=set(), 
		         delimiter:str=','):
		#we don't want no missing values over here
        self.mlc = multilabel_column.dropna() 
        
        self.unique_labels = unique_labels
        self.delimiter = delimiter
        self.encodings = None
    
    def extract_unique_labels(self,):
        unique_categories = self.mlc.unique()

        for cat in unique_categories:
            if isinstance(cat, str): 
                if self.delimiter in cat:
                    row_cats = cat.split(self.delimiter)
                    for row_cat in row_cats:
                        self.unique_labels.add(row_cat)
                else:
                    self.unique_labels.add(row_cat)
        return
    
    def multilabel_oh_encode(self):
        label_to_int = dict((label, index) for index, label in enumerate(self.unique_labels))
        label_encoded = [[label_to_int[label] for label in row.split(self.delimiter) ] for row in self.mlc]

        #create one hot list
        ohe_list = list()

        if len(self.unique_labels) == 0:
            raise ValueError('No unique labels, have you tried calling the `extract_unique_labels` method?')

        for row in label_encoded:
            row_enc = [0 for _ in range(len(self.unique_labels))]
            for label in row:
                row_enc[label] = 1
            ohe_list.append(row_enc)

        self.encodings = pd.Series(ohe_list)
        return 

This makes it super easy to use anywhere like so:

df_y = df.pop('Market Category')
encoder = MultiLabelEncoder(multilabel_column=df_y)
encoder.extract_unique_labels()
encoder.multilabel_oh_encode()
df_y_encoded = encoder.encodings

There, now we all can encode our multi-label columns anytime we encounter them
( ^_^)

And that's basically it

These articles were a lot of help to me:

Ohh yeah, you can get the full source code here

That's all for now, until next time : )