A Love Letter to Functions

“Programming isn’t about what you know; it’s about what you can figure out.” 

– Chris Pine

It wasn’t love at first sight.

When I was first introduced to functions, they seemed perfectly fine, but a little bland. Sort of like that person that your grandmother would be happy for you to bring home – steady, predictable and let’s face it, a little boring.

But, oh how wrong I was.

In the beginning, I would tend to forget about them until after the fact, reworking code to include them only after I’d accomplished my goals for the task at hand. I felt something shift as I worked on my most recent project, which dealt with working iteratively to find the best linear regression model for a data set.

For the first time, I found myself coding proactively instead of reactively (which was a goal I set for myself after I got myself into a reactive tangled mess in my first project). After exploring the data but before doing any real work, I thought about what tasks would need to be repeated throughout the project:

  • creating barplots of categorical features vs. the median of the target variable for that category
  • performing train-test-split on dataframes, further splitting into numerical and categorical features, performing scaling and one-hot-encoding, respectively and returning the transformed dataframes
  • creating scatterplots of the residuals
  • creating histograms and boxplots of the residuals
  • calculating (and adding to a dictionary) the R2, adjusted R2, MAE, MSE, RMSE, Durbin-Watson score, J-B score and VIF for each model

I started by using my first model attempt to write and tweak each block of code. Then, once I was satisfied with the output, I modified it and turned them into functions. Functions that I could then easily use for the rest of the project.

That’s when I fell in love. It took mere moments for me to select new features and then create and validate a model using those parameters. As I worked, I decided that some things in the graphs or model validation needed changing and in seconds, the changes were applied to every model that I had created (which at one point, was up to 14!). It was fast. It was easy. And it allowed me to focus on the models rather than on typing out a bunch of code.

It’s still a new relationship and I have much yet to learn about all that functions have to offer. My next goal is to get better at writing smaller functions that can then be assembled into larger ones. I also still end up repeating some code that could be written as a function and I’m still learning how to use global variables to my advantage.

But that’s okay. This is love; I have a lifetime to get to know them better:)

Here’s the function I wrote to handle the train-test-split process.

def split_and_process(df, target, test_size=0.25, random_state = 100):
    
    """Takes in a dataframe and the name of the target. Splits the dataframe
    into X_train, X_test, y_train, y_test. Next, the features are divided into
    numerical and categorical and are scaled and one-hot-encoded, 
    respectively. Finally, these are changed back into dataframes and the 
    transformed X_train, X_test, y_train, y_test dataframes are returned."""
    
    df = df.copy()
    X = df.drop(target, axis=1)
    y = df[target]
    
    X_train, X_test, y_train, y_test = train_test_split(
        X,y, test_size=test_size, random_state=random_state)
    
    num = X_train.select_dtypes('number').columns
    cat = X_train.select_dtypes('object').columns
    
   
    ohe = OneHotEncoder(drop='first', sparse=False)
    X_train_cat = ohe.fit_transform(X_train[cat])
    X_train_cat = pd.DataFrame(X_train_cat, 
                               columns = ohe.get_feature_names(cat))

    X_test_cat = ohe.transform(X_test[cat])
    X_test_cat = pd.DataFrame(X_test_cat, 
                              columns = ohe.get_feature_names(cat))
    
    
    
    scale = StandardScaler()
    X_train_num = scale.fit_transform(X_train[num])
    X_train_num = pd.DataFrame(X_train_num, columns=num)

    X_test_num = scale.transform(X_test[num])
    X_test_num = pd.DataFrame(X_test_num, columns = num)
    
    X_train_processed = pd.concat([X_train_num, X_train_cat], 
                                  axis=1).reset_index(drop=True)
    X_test_processed = pd.concat([X_test_num, X_test_cat], axis=1)

    y_train = y_train.reset_index(drop=True)
    
    return X_train_processed, X_test_processed, y_train, y_test

Lesson of the Day

I just used the Scikit Learn Pipeline tool for the first time in this project. It, and so many other tools in Scikit Learn, make me appreciate code that is well-written for the intended purpose.

Frustration of the Day

It’s hard to know when something is “good enough” and it’s time to call it a day.

Win of the Day

I had a MUCH easier time with my 2nd project – I planned better, I had to look up fewer things and I broke through roadblocks much faster. I’m learning!

Current Standing on the Imposter Syndrome Scale

2/5

Feeling pretty good today:)

Leave a comment

Design a site like this with WordPress.com
Get started