Not So Mellow Mushrooms: A Classification Task

On my way home, I noticed a few mushrooms that had sprung up after the rain. They were perfect and intact because everyone knew they were poisonous.

— Paulo Coelho

With all of the rain lately, I’ve noticed quite a few mushrooms on my daily walks. Some are tall , balancing caps on top of slender stalks, resembling an art instillation along the trail.

Others hug close to the ground, their gills pressed to the soil as though they are afraid of heights.

Some wear brightly-colored caps that stand out on the woodland floor.

Many are beautiful.

All are interesting.

And some are deadly.

Do YOU know which ones are poisonous?

Starting With the Data

I started by downloading the mushroom.csv from Kaggle. This dataset contains 8,124 hypothetical observations of 23 different mushroom species found in North America. Each observation is described by 22 different features (including shape, color, rings, odor, etc.) and is classified as either edible (yum!) or poisonous (uh oh!).

My goal is to use this labeled dataset to create a model that can be used to determine if you can make a mushroom omelet or if you should keep your distance.

I am going to be using be using a logistic regression model. Before I go any further, I need to check to see if the target variable (class) is balanced. If it is not, I will need to apply a resampling technique in order to have accurate results.

df['class'].value_counts(normalize=True)

We’re in luck! 52% are classified as edible and 48% are deemed poisonous. This means the classes are already balanced.

Next, I have some decisions to make. I decide that I only want to focus on features that are ones that a non-expert could distinguish in the field. Based on this, I’m going to limit my features to only cap-shape, cap-color, gill-color, odor (although I do have some doubt on this one when I consult the documentation and see descriptors of “fishy” and “foul”) and habitat.

I also start to think about what preprocessing I will need to do. All of the data is currently categorical and encoded as strings. In order to perform logistic regression, I will need to make these numerical.

I will use SciKit-learn’s LabelEncoder on the target in order to transform it into 0s and 1s before splitting the data because it will not lead to any data leakage. With the features, I will need to include OneHotEncoder into my pipeline for cross validation in order to prevent data leakage.

#split the features and target
y = df['class']
X = df.drop('class', axis=1)

#select only the desired columns
X = X[['cap-shape', 'cap-color','gill-color','odor','habitat']]

#transform the target
le = LabelEncoder()
y = le.fit_transform(y)

Once encoded, 1 indicates poisonous and 0 corresponds to edible.

Then, before anything else is done, it’s time to train_test_split and set aside the validation data. It’s always a good idea to check the shape of the data to make sure that it’s what you expect.

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state=44, 
                                                    stratify=y)
X.shape, y.shape

((8124, 5), (8124,))


X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6093, 5), (2031, 5), (6093,), (2031,))

Training the Model

Because all of the predictors will be one hot encoded, there is no need to scale this data. As a result, the pipeline is quite simple.

pipeline = Pipeline(steps = [['ohe', OneHotEncoder(handle_unknown='ignore')],
                             ['classifier', LogisticRegression(random_state=44,
                             max_iter=1000)]])

Next, I will use GridSearch and cross validation in order to access the performance of the model and select the best value for C. This is a regularization hyperparameter, where the smaller the value, the stronger the regularization.

stratified_kfold = StratifiedKFold(n_splits=5,
                                       shuffle=True,
                                       random_state=44)
    
    
param_grid = {'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}

grid_search = GridSearchCV(estimator=pipeline,
                           param_grid=param_grid,
                           scoring=['neg_log_loss', 'f1'],
                           cv=stratified_kfold,
                           n_jobs=-1,
                           refit='neg_log_loss',
                           return_train_score=True)

grid_search.fit(X_train, y_train)

This told me that the ideal value for C is 1000, indicating that not much regularization needs to be done. I quickly examine the metrics for the cross-validated model with C=1000:

param_classifier__C                             1000
params                       {'classifier__C': 1000}
split0_test_neg_log_loss                 -0.00642688
split1_test_neg_log_loss                  -0.0074723
split2_test_neg_log_loss                  -0.0204453
split3_test_neg_log_loss                  -0.0160083
split4_test_neg_log_loss                  -0.0124653
mean_test_neg_log_loss                    -0.0125636
std_test_neg_log_loss                     0.00524552
rank_test_neg_log_loss                             1
split0_train_neg_log_loss                  -0.011294
split1_train_neg_log_loss                 -0.0112726
split2_train_neg_log_loss                -0.00849542
split3_train_neg_log_loss                -0.00930276
split4_train_neg_log_loss                -0.00986136
mean_train_neg_log_loss                   -0.0100452
std_train_neg_log_loss                    0.00110025
split0_test_f1                              0.998296
split1_test_f1                              0.997451
split2_test_f1                              0.993139
split3_test_f1                              0.995723
split4_test_f1                              0.997438
mean_test_f1                                0.996409
std_test_f1                               0.00183693
rank_test_f1                                       1
split0_train_f1                             0.996583
split1_train_f1                             0.996157
split2_train_f1                             0.997226
split3_train_f1                              0.99744
split4_train_f1                             0.997012
mean_train_f1                               0.996884
std_train_f1                              0.00046093

Looking at the log-loss, I notice that the values are consistent across the splits and that the loss is slightly larger for the test set, which is not surprising (remember to consider the opposite of these values as they’re listed). All of the log-loss seems quite small, indicating a small error rate between the predicted class and the actual class.

Next, I look at the f1 scores, which are the harmonic mean of precision and recall, giving a quick idea of overall performance. Again, these are consistent between folds (which is also verified by the very small standard deviations). Additionally, the f1 scores are quite high for both the train and test sets. I decide to proceed with this model.

#Fitting a final model on the entire train set

pipeline_final = Pipeline(steps = [['ohe', OneHotEncoder(handle_unknown='ignore')],
                             ['classifier', LogisticRegression(random_state=44,
                             max_iter=1000, C=1000)]])

best_model = pipeline_final.fit(X_train, y_train)

y_pred = best_model.predict(X_test)
pd.Series(y_pred).value_counts()

#predictions
0    0.519449 #edible
1    0.480551 #poisonous

#actual
e    0.517971 #edible
p    0.482029 #poisonous

Wow! On first glance, that seems like an awesome result! But with data, as with mushrooms, you don’t want to rush to judgment too quickly as the results could be disastrous. So, let’s look a little deeper.

Evaluating the Model

My first step is to make a quick histogram of the probabilities (not just the actual classes) that the model assigned to each data point.

These values are clustered around 0 and 1, which means that the model is quite certain about its predictions. We can also see the roughly 50-50 split that we saw with the actual classifications. Looking good so far.

Next, I dig deeper into the metrics by plotting a confusion matrix.

plot_confusion_matrix(pipeline_final, X_test, y_test);

Let’s use this to get three important metrics.

Accuracy

Accuracy is the total number of correct predictions out of the total number of predictions.

2024/(2024+7) = 99.66%

This means the model was correct was correct 99.66% of the time.

Precision

Precision is how many are actually positive (which in this case, is poisonous or 1) out of the ones that the model predicted were positive.

974/(974+2)= 99.80%

This means that of the mushrooms the model indicated were poisonous, 99.8% actually are.

Recall

Recall is out of the actual positive cases (again, poisonous or 1), how many the model recognized as positive.

974/(974+5) = 99.48%

This means that the model picked up on 99.48% of the poisonous mushrooms. Which doesn’t seem so bad unless you happen to be one of those 0.52% of people that inadvertently makes a deadly omelet.

Precision/Recall Tradeoff

This is one of those situations where it is much more preferable to have false positives (a perfectly harmless mushroom is labeled as poisonous) than false negatives (you think a deadly mushroom is harmless). It’s okay if you skip eating a mushroom that won’t hurt you. but it’s a pretty bad day if you eat one that will cause your limbs to fall off of your body.

Adjust the Threshold

As the model is written, it will interpret any probability below 0.5 as a harmless blob of fungi. But because a false negative here is so dangerous, I’m going to set that threshold to 0.1 so that any data points with above a 10% probability of being poisonous will be put into the “Don’t eat!” bin.

def final_model_func(model, X):
    probs = model.predict_proba(X)[:,1]
    return [int(prob > 0.01) for prob in probs]

threshold_adjusted_probs = pd.Series(final_model_func(pipeline_final, X_test))
threshold_adjusted_probs.value_counts(normalize=True)

#new predictions
1    0.520433 #edible
0    0.479567 #poisonous

#original predictions
0    0.519449 #edible
1    0.480551 #poisonous

We can see that by doing that, the percentage of mushrooms classified as poisonous increased slightly. Now, let’s see what happened to our metrics.

print(f"Accuracy: {accuracy_score(y_test, threshold_adjusted_probs)}")
print(f"Precision: {precision_score(y_test, threshold_adjusted_probs)}")
print(f"Recall: {recall_score(y_test, threshold_adjusted_probs)}")

Accuracy: 0.9615952732644018
Precision: 0.9262062440870388
Recall: 1.0

By changing the threshold, the accuracy and the precision both dropped. This is because the model is now classifying some harmless fungi as bad guys. But the tradeoff is worth it because that recall value of 1 says that we’re not going to accidentally ingest a poisonous mushroom (as long as we listen to the model, that is!)

Interpreting the Model

Let’s look at the features with the largest or smallest coefficients, as these are the characteristics that are the most impactful on the decision.

That’s a little tricky to interpret since the features have been one hot encoded. Let’s remind ourselves what the general categories are.

x0 = ‘cap-shape’

x1 = ‘cap-color’

x2 = ‘gill-color’

x3 = ‘odor’

x4 = ‘habitat’

Looking at the most impactful coefficients, x0 does not appear, so cap shape is not a great predictor of the presence of poison. The rest can be attributed to cap-color (10%), gill-color (20%), odor (60%) and habitat (10%). So, it looks like your nose knows if a mushroom will harm you.

This can be confirmed with the following graph, which clearly shows that odor alone is a pretty good separator of poisonous vs. tasty shrooms.

But that’s still not super helpful. After all, what DO the dangerous ones smell like?

Well, that’s actually pretty handy After all, if a mushroom smells pungent, foul, fishy, tar-like (creosote) or musty, it’s not exactly begging to be put on your pizza. Buy you might need to look out for those spicy ones and the ones with no discernible odor.

And in case you want to become your own mushroom classification machine, here are the other characteristics.