Machine Learning Coding Tutorial 3. What Makes a Good Feature?
In the previous tutorial, we used decision tree as the classifier. Classifiers are only as good as the features you provide.
That means coming with good features is one of your most important jobs in machine learning.
1. Dog Classifier
Imagine we want to write a classifier to tell the difference between two types of dogs: greyhounds, and Labradors.
Here we’ll use one feature: the dog’s height in inches. Greyhounds are usually taller than Labradors.
2. Coding
Let’s head into Python for a programmatic example.
Create a python file dogs.py and write following code to program.
Please read comments carefully to understand the meaning of codes.
""" GoodTecher Machine Learning Coding Tutorial http://72.44.43.28 Machine Learning Coding Tutorial 3. What Makes a Good Feature? The program takes a measurements (dogs height) as input and display normal distribution of types of dogs """ import numpy as np import matplotlib.pyplot as plt # creates 500 Greyhounds and 500 Labradors number_of_greyhounds = 500 number_of_labradors = 500 # Greyhounds on average 28 inches tall # Labradors on average 24 inches tall # Let's say height is normally distributed, # so we'll make both of these plus or minus 4 inches # the following code generates arrays of 500 Greyhound heights # and 500 Labradors heights greyhounds_heights = 28 + 4 * np.random.randn(number_of_greyhounds) labradors_heights = 24 + 4 * np.random.randn(number_of_labradors) # visualize in histogram, # Greyhounds are in red, Labradors are in blue plt.hist([greyhounds_heights, labradors_heights], stacked=True, color=['r', 'b']) plt.show()
Run the program with the following command in Terminal (Mac) or Command Prompt (Windows):
python dogs.py
You should see a popup window of a histogram.
3. Explanation
To the left of the histogram, the probability of dogs is to be Labradors. On the other hand, if we go all the way to the right of the histogram and we look at a dog who is 35 inches tall, we can be confident they are greyhound.
In the middle, the probability of each type of dog is close.
So height is a useful feature, but it’s not perfect. That’s why in machine learning, you almost need multiple features. Otherwise, we can just write if statement instead of bothering with the classifier.
Ideal features are
- informative
- independent
- easy to understand
informative
For example, eye colors of dogs are useless to tell what type of dogs it is.
independent
For example, Height in inches and Height in centimeters are redundant
easy to understand
For example, to estimate the time to fly from a city to another. The distance between two cities is better than the longitude, latitude information of two cities.