Linear Regression with PySpark

19 minute read

Intro

I’m running a Linear Regression on credit card data using PySpark. The objective is to create a regression that predicts credit card debt using several descriptive features. In a previous notebook I explored the dataset and reduced the features down to the few I wanted to test. The dataset itself is only 5000 rows but in the case that the dataset were too large to run on a single machine the data would need to be processed across several machines. Therefore I handle the feature transformations and regressions with PySpark as if this were the case.

Set Up

# importing packages
import findspark
findspark.init()
import pyspark
import pyspark.sql.functions as F
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import StringType, DoubleType

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Use this to change pandas row settings. Good for when I need to see entire dataset
# pd.set_option("display.max_rows", None)
pd.set_option('display.max_rows', 100)

# Selected features from earlier data exploration
columns = ['inccat', 'carcatvalue', 'default', 'empcat', 'agecat', 'cardtenurecat', 'card2tenurecat', 'retire', 'carown', 'jobsat', 'creddebt', 'income', 'othdebt', 'carvalue', 'debtinc', 'cardspent', 'card2spent', 'employ', 'tollten', 'wireten', 'cardtenure']

Below I inital Spark, import the data into the spark session and create a temporary view so that I can also query the data with SQL.

sc = pyspark.SparkContext()
# sc.stop()

spark = pyspark.sql.SparkSession(sc)

data = spark.read.csv('credit_card_regression_set.csv', header = True, inferSchema = True)

data.createOrReplaceTempView('data')

These are the variables decided during my initial exploration of the 130 features.

y = 'creddebt'

nums = ['income',
 'othdebt',
 'carvalue',
 'debtinc',
 'cardspent',
 'card2spent',
 'employ',
 'tollten',
 'wireten',
 'cardtenure']

cats = ['inccat',
 'carcatvalue',
 'default',
 'empcat',
 'agecat',
 'cardtenurecat',
 'card2tenurecat',
 'retire',
 'carown',
 'jobsat']

binary = ['default', 'retire']

Feature Engineering

Before I run the model I need to perform transformations that prepare the data for modeling. In this case It is mainly handling null values.

spark.sql("""SELECT inccat, carcatvalue,default, empcat, agecat, cardtenurecat, card2tenurecat, retire,carown, jobsat FROM data""" ).show()

+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|inccat|carcatvalue|default|empcat|agecat|cardtenurecat|card2tenurecat|retire|carown|jobsat|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|   2.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   1.0|
|   1.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   1.0|
|   2.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   0.0|   1.0|   4.0|
|   1.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   2.0|
|   1.0|        1.0|    0.0|   1.0|   3.0|          3.0|           3.0|   0.0|   0.0|   1.0|
|   4.0|       -1.0|    0.0|   5.0|   5.0|          5.0|           3.0|   0.0|  -1.0|   2.0|
|   4.0|        2.0|    0.0|   3.0|   5.0|          2.0|           2.0|   0.0|   1.0|   2.0|
|   4.0|        3.0|    0.0|   4.0|   4.0|          5.0|           5.0|   0.0|   1.0|   5.0|
|   1.0|        1.0|    0.0|   4.0|   6.0|          5.0|           5.0|   1.0|   1.0|   2.0|
|   4.0|        3.0|    0.0|   5.0|   4.0|          2.0|           2.0|   0.0|   1.0|   4.0|
|   2.0|        2.0|    0.0|   3.0|   5.0|          5.0|           5.0|   0.0|   1.0|   3.0|
|   1.0|        1.0|    0.0|   2.0|   3.0|          3.0|           2.0|   0.0|   1.0|   4.0|
|   3.0|        2.0|    0.0|   4.0|   4.0|          1.0|           1.0|   0.0|   1.0|   5.0|
|   3.0|       -1.0|    0.0|   2.0|   5.0|          5.0|           5.0|   0.0|  -1.0|   3.0|
|   1.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   1.0|   1.0|   4.0|
|   1.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   1.0|   1.0|   5.0|
|   5.0|        3.0|    0.0|   5.0|   5.0|          5.0|           4.0|   0.0|   1.0|   5.0|
|   5.0|        3.0|    0.0|   5.0|   5.0|          5.0|           5.0|   0.0|   1.0|   4.0|
|   1.0|       -1.0|    0.0|   2.0|   3.0|          3.0|           3.0|   0.0|  -1.0|   1.0|
|   1.0|        1.0|    0.0|   3.0|   6.0|          3.0|           3.0|   1.0|   1.0|   3.0|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
only showing top 20 rows

The categorical variables have null values listed as ‘-1.0’. These null values need to be handled in order for the regression to run correctly. In this case I decided to handle them to reassigning them to the most frequent value. The code below iterates through the categorical columns, queries the column for the most frequent value and replaces all null cases with that value.

for column in cats:
#     column = 'carcatvalue'
    data = data.withColumn(column, data[column].cast(StringType()))
    query = """SELECT {0}, COUNT({0}) as count FROM data GROUP BY {0} ORDER BY count DESC LIMIT 1 """.format(column)
    frequent = str(spark.sql(query).first()[0])
    data = data.withColumn(column, F.regexp_replace(column, '-1.0', frequent))

data.select(cats).show()

+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|inccat|carcatvalue|default|empcat|agecat|cardtenurecat|card2tenurecat|retire|carown|jobsat|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|   2.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   1.0|
|   1.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   1.0|
|   2.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   0.0|   1.0|   4.0|
|   1.0|        1.0|    1.0|   1.0|   2.0|          2.0|           2.0|   0.0|   1.0|   2.0|
|   1.0|        1.0|    0.0|   1.0|   3.0|          3.0|           3.0|   0.0|   0.0|   1.0|
|   4.0|        1.0|    0.0|   5.0|   5.0|          5.0|           3.0|   0.0|   1.0|   2.0|
|   4.0|        2.0|    0.0|   3.0|   5.0|          2.0|           2.0|   0.0|   1.0|   2.0|
|   4.0|        3.0|    0.0|   4.0|   4.0|          5.0|           5.0|   0.0|   1.0|   5.0|
|   1.0|        1.0|    0.0|   4.0|   6.0|          5.0|           5.0|   1.0|   1.0|   2.0|
|   4.0|        3.0|    0.0|   5.0|   4.0|          2.0|           2.0|   0.0|   1.0|   4.0|
|   2.0|        2.0|    0.0|   3.0|   5.0|          5.0|           5.0|   0.0|   1.0|   3.0|
|   1.0|        1.0|    0.0|   2.0|   3.0|          3.0|           2.0|   0.0|   1.0|   4.0|
|   3.0|        2.0|    0.0|   4.0|   4.0|          1.0|           1.0|   0.0|   1.0|   5.0|
|   3.0|        1.0|    0.0|   2.0|   5.0|          5.0|           5.0|   0.0|   1.0|   3.0|
|   1.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   1.0|   1.0|   4.0|
|   1.0|        1.0|    0.0|   5.0|   6.0|          5.0|           5.0|   1.0|   1.0|   5.0|
|   5.0|        3.0|    0.0|   5.0|   5.0|          5.0|           4.0|   0.0|   1.0|   5.0|
|   5.0|        3.0|    0.0|   5.0|   5.0|          5.0|           5.0|   0.0|   1.0|   4.0|
|   1.0|        1.0|    0.0|   2.0|   3.0|          3.0|           3.0|   0.0|   1.0|   1.0|
|   1.0|        1.0|    0.0|   3.0|   6.0|          3.0|           3.0|   1.0|   1.0|   3.0|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
only showing top 20 rows

The numerical columns have a similar issue. Null values are listed as ‘-1.0’. I’ll use similar code except I will use the mean rather than the mode in the case of numeric features.

for column in nums:
    data = data.withColumn(column, data[column].cast(StringType()))
    query = """SELECT MEAN({0}) FROM data""".format(column)
    mean = str(spark.sql(query).first()[0])
    data = data.withColumn(column, data[column].cast(StringType()))
    data = data.withColumn(column, F.regexp_replace(column, '-1.0', mean))
    data = data.withColumn(column, data[column].cast(DoubleType()))

data.select(nums).show()

+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
|income|othdebt|          carvalue|debtinc|cardspent|card2spent|employ|tollten|wireten|cardtenure|
+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
|  31.0|   2.24|              14.3|   11.1|    81.66|      67.8|   0.0| 161.05|    0.0|       2.0|
|  15.0|   1.57|               6.8|   18.6|     42.6|     34.94|   0.0|    0.0|1683.55|       4.0|
|  35.0|   2.54|              18.8|    9.9|   184.22|    175.75|  16.0|    0.0|    0.0|      35.0|
|  20.0|   1.12|               8.7|    5.7|   340.99|     18.42|   0.0|    0.0|    0.0|       5.0|
|  23.0|   0.18|              10.6|    1.7|    255.1|    252.73|   1.0|  387.7|  410.8|       8.0|
| 107.0|   4.93|23.232580000000013|    5.6|   228.27|       0.0|  22.0|  726.6|    0.0|      18.0|
|  77.0|   0.96|              25.6|    1.9|   822.32|    130.14|  10.0|    0.0|    0.0|       3.0|
|  97.0|   8.02|              55.5|   14.4|    592.7|     712.1|  11.0| 1110.4|    0.0|      25.0|
|  16.0|   0.31|               8.6|    2.6|   326.59|    141.24|  15.0|    0.0|    0.0|      26.0|
|  84.0|   1.67|              41.0|    4.1|   199.64|    111.17|  19.0|    0.0|    0.0|       2.0|
|  47.0|   2.68|              28.0|    8.6|   488.97|    322.07|   8.0|    0.0|    0.0|      16.0|
|  19.0|   0.04|               9.3|    0.9|   338.26|     55.17|   4.0|    0.0|    0.0|       6.0|
|  73.0|   0.79|              38.5|    2.8|   534.36|    198.39|  12.0|    0.0|    0.0|       1.0|
|  63.0|   4.69|23.232580000000013|   10.5|    593.5|    384.94|   3.0| 1776.7|    0.0|      25.0|
|  17.0|   0.39|               9.3|    9.8|   233.17|     39.93|  27.0|    0.0|    0.0|      33.0|
|  23.0|    1.2|              11.6|    9.3|   297.47|     98.65|  31.0| 1919.0|3159.25|      40.0|
| 171.0|  14.44|              75.8|    9.5|   305.94|     96.23|  24.0|1444.65|1840.45|      19.0|
| 424.0|  32.26|              88.6|   10.7|   495.75|    798.31|  29.0|3165.45|    0.0|      36.0|
|  23.0|   0.91|23.232580000000013|    4.8|   442.09|    144.69|   4.0|1099.65|    0.0|      10.0|
|  22.0|   2.65|              10.8|   15.2|     8.11|       0.0|  10.0|    0.0|    0.0|      10.0|
+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
only showing top 20 rows

data_train, data_test = data.randomSplit([.80, .20], seed = 4)

Creating the Pipeline

With the data clean enough for modeling I need to create a pipeline that runs the regression. Techincally this process includes additional feature engineering as the categorical variables need to be indexed and one hot encoded. Finally, all of the features need to be assembled into a single vector that PySpark is capable of reading.

# Index and encoding objects
index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)

encoded_list = [cat + '_encoded' for cat in cats]
one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)

# This code block allows me to decide features. I comment the extend methods that I wish to exclude when running the model
inputs = []
inputs.extend(nums)
inputs.extend(encoded_list)
inputs

['income',
 'othdebt',
 'carvalue',
 'debtinc',
 'cardspent',
 'card2spent',
 'employ',
 'tollten',
 'wireten',
 'cardtenure',
 'inccat_encoded',
 'carcatvalue_encoded',
 'default_encoded',
 'empcat_encoded',
 'agecat_encoded',
 'cardtenurecat_encoded',
 'card2tenurecat_encoded',
 'retire_encoded',
 'carown_encoded',
 'jobsat_encoded']

# This code creates the assembler and regression objects.

assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')

regression = LinearRegression(labelCol='creddebt')

# pipeline object contains previously created objects
pipepline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])

# model object fits the pipeline to the training data. predictions
model = pipepline.fit(data_train)

predictions = model.transform(data_test)

Evaluation

The table below shows the model predictions compared to the actual credit debt. ‘Creddebt’ is listed in the thousands. Therefore in the case of the first observation, the model predicted that this observation with an actual debt of 140 dollars would actually be about 600 dollars in debt.

predictions.select(['creddebt','prediction']).show()

+--------+--------------------+
|creddebt|          prediction|
+--------+--------------------+
|    0.14|  0.6093043679675625|
|    0.24| 0.43905437552333204|
|    0.52|0.012254495737605176|
|    0.39|   1.507075165380026|
|    1.13|   3.206200072134242|
|    0.02| -0.7176964758160933|
|    0.45|  0.8612760486636706|
|    2.46|  2.5064292029472672|
|    6.61|   4.050840955195323|
|   44.25|   13.44364556197883|
|    0.15| -0.5992770881744303|
|    8.15|    5.87001498551057|
|    0.39|-0.26548513694924303|
|    0.78|  1.7494149226058622|
|    0.34| -0.4892234737470511|
|   29.47|   9.280357979257039|
|    0.11| 0.22478571931758262|
|    4.73|   6.389449128368286|
|    0.78|  1.3024720830240581|
|    1.77|   1.021364413293012|
+--------+--------------------+
only showing top 20 rows

# Evaluation checks the total error of the model based on an evaluation metric in this case the metric is  root mean squared error
evaluation = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt').evaluate(predictions)

evaluation

2.2154065173974407

Documentation

I created a dataframe and a function to add the results of each prediction to my model to document the models improvement as I change parameters and features.

metrics = pd.DataFrame({'model': [], 'rmse':[], 'mse' : []})
# metrics.to_csv('metrics.csv')
# metrics = pd.read_csv('metrics.csv')
# metrics = metrics.iloc[:,1:]

# These functions evaluate the rmse and mse of the column and add them to my metrics dataframe.
def add_row(df, row):
    df.loc[len(df)] = row

def add_model_metrics(df, model_name, predictions):
    rmse = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt').evaluate(predictions)
    mse = RegressionEvaluator(metricName = 'mse', labelCol='creddebt').evaluate(predictions)

    row = [model_name, rmse, mse]

    add_row(df, row)

# add_model_metrics(metrics, 'nums_and_cats', predictions)

These are the first two models I ran one with only the numerical variables and a second with both the numerical and categorical variables. The first model has a rmse of 2.275 which suggests that the model is off by about 2270 dollars on average. With the categorical variables, the model is 60 dollars more effective on average when predicted on the test data.

metrics

	model	rmse	mse
0	num_only	2.275009	5.175667
1	nums_and_cats	2.215407	4.908026

Additional Features

Cross Validation

Adding cross validation (5 fold) to potentially increase model performance. Code below is similar to previous pipeline code with some additions for cross validation.

index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)

encoded_list = [cat + '_encoded' for cat in cats]

one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)

assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')

regression = LinearRegression(labelCol='creddebt')

pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])

evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')

paramGrid = ParamGridBuilder().build()

cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)

cvModel = CrossValidator.fit(cvModel, data_train)

predictions = cvModel.transform(data_test)

add_model_metrics(metrics, 'all_vars_cv_1', predictions)
metrics

	model	rmse	mse
0	num_only	2.275009	5.175667
1	nums_and_cats	2.215407	4.908026
2	all_vars_cv_1	2.215407	4.908026

It appears that cross validation had no effect on the model performance.

Log Transformation

There are a few feature adjustments that I could potentially improve the model. First, some of the continuous variables aren’t normally distributed. Applying a log transformation to some of the continuous variables notably changed the distribution to be more normal.

to_be_logged = ['income',
 'othdebt',
 'carvalue',
 'cardspent',
 'card2spent',
 'employ',
 'tollten',
 'wireten']

reg_nums = ['debtinc',
 'cardtenure']

log_nums = ['log_' + column for column in to_be_logged]

# This code goes through the to_be_logged list and applies the log transformation.
from pyspark.sql.functions import log
for column in to_be_logged:
    data = data.withColumn('log_' + column, log(data[column] + 1))

# Have to reinitialize training data to include new log values.
data_train, data_test = data.randomSplit([.80, .20], seed = 4)

# Ran this three times once with only continuous and a second time with categorical variables third time with both log and continuous
inputs = []
inputs.extend(log_nums)
inputs.extend(reg_nums)
index_list = [cat +'_idx' for cat in cats]
inputs.extend(index_list)
inputs.extend(to_be_logged)


string_index = StringIndexer(inputCols = cats, outputCols = index_list)

encoded_list = [cat + '_encoded' for cat in cats]

one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)

assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')

regression = LinearRegression(labelCol='creddebt')

pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])

evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')

paramGrid = ParamGridBuilder().build()

cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)

cvModel = CrossValidator.fit(cvModel, data_train)

predictions = cvModel.transform(data_test)

# log only
predictions.select(['creddebt','prediction']).show()

+--------+--------------------+
|creddebt|          prediction|
+--------+--------------------+
|    0.14|   1.647075423659162|
|    0.24|  0.6398959925978822|
|    0.52|  1.0997906096953702|
|    0.39|  0.3504364866826766|
|    1.13|  2.9820424472769513|
|    0.02| -1.9022647714346022|
|    0.45|  0.7227840152447431|
|    2.46|  3.6401236181581247|
|    6.61|   4.143968229124148|
|   44.25|   9.636200858653792|
|    0.15|   -1.18055814271729|
|    8.15|   7.317615336208826|
|    0.39| -0.4845952073333475|
|    0.78|  1.3886183932839593|
|    0.34|  -0.342778877833398|
|   29.47|   8.245049772042242|
|    0.11|-0.11961579436530378|
|    4.73|   4.785692639501264|
|    0.78|-0.01416733325522...|
|    1.77|   1.573471743150824|
+--------+--------------------+
only showing top 20 rows

# logs_and_cats
predictions.select(['creddebt','prediction']).show()

+--------+-------------------+
|creddebt|         prediction|
+--------+-------------------+
|    0.14| 1.2884720002611854|
|    0.24|-0.1333820961933423|
|    0.52| 0.6504041999093033|
|    0.39| 1.1970963328240192|
|    1.13| 2.7496023407588517|
|    0.02|-1.4263460324400405|
|    0.45|  -0.19447120947091|
|    2.46|  3.573529458318564|
|    6.61|   5.61117129623733|
|   44.25| 10.727084238308777|
|    0.15|-2.0566687769911294|
|    8.15|  7.613715198840104|
|    0.39|-1.2405597564490094|
|    0.78| 2.6866694377778852|
|    0.34| -1.012504694987344|
|   29.47|  9.559866729767748|
|    0.11|-0.5206746091395491|
|    4.73| 5.0217777453217884|
|    0.78| 0.7543957441303526|
|    1.77|   0.81080977472668|
+--------+-------------------+
only showing top 20 rows

# nums_logs_and_cats
predictions.select(['creddebt','prediction']).show()

+--------+-------------------+
|creddebt|         prediction|
+--------+-------------------+
|    0.14|0.49792826424154235|
|    0.24|0.41489351806923125|
|    0.52| 1.1100405323315137|
|    0.39| 1.0485681882122222|
|    1.13| 1.8889272517262299|
|    0.02|-0.6776597182744579|
|    0.45|0.48076946156178746|
|    2.46| 3.4513634556554464|
|    6.61|  4.029615100979548|
|   44.25| 14.408801210838694|
|    0.15|-0.8155181450877169|
|    8.15|  7.228276573264578|
|    0.39|-0.1515153538643217|
|    0.78| 0.9778442365562707|
|    0.34|0.41171528082751685|
|   29.47| 10.902899075433812|
|    0.11|-0.5355067880359905|
|    4.73|  5.778526907345183|
|    0.78|  1.628078117979622|
|    1.77| 1.4989848161401254|
+--------+-------------------+
only showing top 20 rows

add_model_metrics(metrics, 'nums_logs_and_cats', predictions)
metrics

	model	rmse	mse
0	num_only	2.275009	5.175667
1	nums_and_cats	2.215407	4.908026
2	all_vars_cv_1	2.215407	4.908026
3	num_only_log	2.335687	5.455432
4	logs_and_cats	2.252368	5.073160
5	nums_logs_and_cats	2.085990	4.351353

Running all of the continous, log adjusted variables, and categorical variables together seems to have gained the best overall performance.

Regularization

The last thing I will try is a grid search with some hyperparameter adjustments. This will include the regularization weight along with the elastic net which balances between the l1 and l2 regularization types. I will add the best iteration to my metrics dataframe.

index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)

encoded_list = [cat + '_encoded' for cat in cats]

one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)

assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')

regression = LinearRegression(labelCol='creddebt')

pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])

evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')

paramGrid = ParamGridBuilder().addGrid(regression.regParam, [0.1, 0.01, .001]).addGrid(regression.fitIntercept, [True])
paramGrid = paramGrid.addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0]).build()

cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)

cvModel = CrossValidator.fit(cvModel, data_train)

predictions = cvModel.transform(data_test)

add_model_metrics(metrics, 'original_logs_cats_hyper_tuned', predictions)
metrics

	model	rmse	mse
0	num_only	2.275009	5.175667
1	nums_and_cats	2.215407	4.908026
2	all_vars_cv_1	2.215407	4.908026
3	num_only_log	2.335687	5.455432
4	logs_and_cats	2.252368	5.073160
5	nums_logs_and_cats	2.085990	4.351353
6	original_logs_cats_hyper_tuned	2.101103	4.414633

print('Best Regularization Parameter: ', cvModel.bestModel.stages[-1]._java_obj.parent().getRegParam())
print('Best ElasticNet Parameter: ',cvModel.bestModel.stages[-1]._java_obj.parent().getElasticNetParam())

Best Regularization Parameter:  0.1
Best ElasticNet Parameter:  0.0

The model chosen was a ridge regression with a strength of .1. Although it performs worse than the previous model it is likely that the ridge regression is more generalizable. I’ll stick with this as the final model when interpreting variables.

Best Model Interpretation

In order to interpret the model I will run a final regression in pandas using the determined parameters. This will allow me to visualize and interpret the coefficients at a more individual level.

pd_data_train = data_train.toPandas()
pd_data_test = data_test.toPandas()

inputs = []
inputs.extend(log_nums)
inputs.extend(reg_nums)
inputs.extend(cats)
inputs.extend(to_be_logged)
inputs.append('creddebt')
inputs = set(inputs)

pd_data_train = pd_data_train[inputs]
pd_data_test = pd_data_test[inputs]

pd_data_train = pd.get_dummies(pd_data_train)
X_train = pd_data_train.drop('creddebt', axis = 1)
y_train = pd_data_train['creddebt']
pd_data_test = pd.get_dummies(pd_data_test)
X_test = pd_data_test.drop('creddebt', axis = 1)
y_test = pd_data_test['creddebt']

pd_data_train

	debtinc	card2spent	income	log_card2spent	log_income	cardtenure	wireten	carvalue	tollten	log_tollten	...	jobsat_1.0	jobsat_2.0	jobsat_3.0	jobsat_4.0	jobsat_5.0	empcat_1.0	empcat_2.0	empcat_3.0	empcat_4.0	empcat_5.0
0	2.0	114.70	85.0	4.751001	4.454347	15.0	621.90	43.10000	554.75	6.320319	...	0	0	0	1	0	0	0	0	1	0
1	13.1	66.60	39.0	4.213608	3.688879	31.0	0.00	10.00000	0.00	0.000000	...	0	0	0	1	0	0	0	0	1	0
2	6.2	394.15	138.0	5.979265	4.934474	1.0	30.05	66.20000	20.50	3.068053	...	0	0	0	1	0	0	1	0	0	0
3	25.1	197.63	63.0	5.291444	4.158883	17.0	0.00	39.30000	0.00	0.000000	...	1	0	0	0	0	0	0	1	0	0
4	9.2	214.91	37.0	5.374862	3.637586	19.0	0.00	15.80000	0.00	0.000000	...	1	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3982	5.9	839.40	84.0	6.733878	4.442651	15.0	887.50	30.60000	557.55	6.325344	...	0	1	0	0	0	0	0	1	0	0
3983	0.8	156.35	47.0	5.058473	3.871201	19.0	0.00	22.40000	0.00	0.000000	...	0	0	1	0	0	0	0	0	1	0
3984	1.3	84.25	31.0	4.445588	3.465736	9.0	0.00	17.10000	0.00	0.000000	...	0	1	0	0	0	0	1	0	0	0
3985	19.5	56.98	26.0	4.060098	3.295837	6.0	0.00	11.60000	0.00	0.000000	...	0	0	1	0	0	0	1	0	0	0
3986	13.4	211.04	36.0	5.356775	3.610918	3.0	380.15	23.23258	108.85	4.699116	...	0	0	1	0	0	0	0	1	0	0

3987 rows × 58 columns

from sklearn.linear_model import ElasticNet
lr = ElasticNet(.1, 0)
lr.fit(X_train, y_train)

C:\Users\Allen\anaconda3\lib\site-packages\sklearn\utils\validation.py:71: FutureWarning: Pass l1_ratio=0 as keyword args. From version 0.25 passing these as positional arguments will result in an error
  FutureWarning)
C:\Users\Allen\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:531: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 8504.175821644329, tolerance: 4.811267459974919
  positive)

ElasticNet(alpha=0.1, l1_ratio=0)

predictions = lr.predict(X_test)

Below I create a dataframe with the coefficient data. I create a column ‘sign’ that is meant to record whether or not the variable is positive or negative. I also create a ‘coefficient_abs’ column which transforms the variable into its absolute value. These columns will be useful for visualizing and comparing all of the features based on their impact and usefulness to the model.

coefficient_data = pd.DataFrame({'feature' : X_train.columns, 'coefficient' : lr.coef_})

coefficient_data = coefficient_data.sort_values('coefficient', ascending = False)

coefficient_data['sign'] = coefficient_data['coefficient'].apply(lambda n: 'positive' if n > 0 else 'negative')

coefficient_data['coefficient_abs'] = coefficient_data['coefficient'].apply(lambda n: np.abs(n))

coefficient_data

	feature	coefficient	sign	coefficient_abs
34	default_1.0	0.362657	positive	0.362657
14	log_carvalue	0.204723	positive	0.204723
0	debtinc	0.202457	positive	0.202457
15	othdebt	0.126545	positive	0.126545
20	inccat_3.0	0.123089	positive	0.123089
19	inccat_2.0	0.094148	positive	0.094148
10	log_employ	0.083809	positive	0.083809
51	jobsat_5.0	0.076830	positive	0.076830
54	empcat_3.0	0.076712	positive	0.076712
30	agecat_6.0	0.067036	positive	0.067036
25	carcatvalue_3.0	0.062675	positive	0.062675
21	inccat_4.0	0.052980	positive	0.052980
45	card2tenurecat_4.0	0.049983	positive	0.049983
55	empcat_4.0	0.049716	positive	0.049716
2	income	0.049407	positive	0.049407
13	log_wireten	0.027120	positive	0.027120
50	jobsat_4.0	0.026380	positive	0.026380
24	carcatvalue_2.0	0.025407	positive	0.025407
37	cardtenurecat_1.0	0.018748	positive	0.018748
28	agecat_4.0	0.018213	positive	0.018213
35	retire_0.0	0.016799	positive	0.016799
42	card2tenurecat_1.0	0.016073	positive	0.016073
41	cardtenurecat_5.0	0.015285	positive	0.015285
32	carown_1.0	0.011392	positive	0.011392
38	cardtenurecat_2.0	0.009653	positive	0.009653
16	cardspent	0.000938	positive	0.000938
5	cardtenure	0.000441	positive	0.000441
1	card2spent	0.000240	positive	0.000240
8	tollten	0.000107	positive	0.000107
6	wireten	-0.000143	negative	0.000143
12	employ	-0.004097	negative	0.004097
39	cardtenurecat_3.0	-0.010948	negative	0.010948
31	carown_0.0	-0.011392	negative	0.011392
47	jobsat_1.0	-0.013124	negative	0.013124
36	retire_1.0	-0.016799	negative	0.016799
43	card2tenurecat_2.0	-0.016905	negative	0.016905
44	card2tenurecat_3.0	-0.020578	negative	0.020578
26	agecat_2.0	-0.020901	negative	0.020901
27	agecat_3.0	-0.022236	negative	0.022236
3	log_card2spent	-0.026547	negative	0.026547
9	log_tollten	-0.027404	negative	0.027404
56	empcat_5.0	-0.027905	negative	0.027905
46	card2tenurecat_5.0	-0.028574	negative	0.028574
7	carvalue	-0.031053	negative	0.031053
40	cardtenurecat_4.0	-0.032739	negative	0.032739
48	jobsat_2.0	-0.040003	negative	0.040003
52	empcat_1.0	-0.041940	negative	0.041940
29	agecat_5.0	-0.042112	negative	0.042112
49	jobsat_3.0	-0.050082	negative	0.050082
53	empcat_2.0	-0.056583	negative	0.056583
18	inccat_1.0	-0.069737	negative	0.069737
23	carcatvalue_1.0	-0.088082	negative	0.088082
4	log_income	-0.145739	negative	0.145739
22	inccat_5.0	-0.200480	negative	0.200480
17	log_cardspent	-0.205127	negative	0.205127
33	default_0.0	-0.362657	negative	0.362657
11	log_othdebt	-1.202853	negative	1.202853

coefficient_data.describe()

	coefficient	coefficient_abs
count	57.000000	57.000000
mean	-0.016617	0.082216
std	0.188638	0.170250
min	-1.202853	0.000107
25%	-0.028574	0.016799
50%	0.000107	0.031053
75%	0.049407	0.076712
max	0.362657	1.202853

Looking at the coefficient data I decide that using the 75% marker on the magnitude makes the most sense and filter at coefficients that don’t have a beta value larger than .07. I then visualize these ‘impactful coefficients’ below.

impactful_coefs = coefficient_data[coefficient_data['coefficient_abs'] > .07]
non_impactful_coefs = coefficient_data[coefficient_data['coefficient_abs'] < .07]

impactful_coefs

	feature	coefficient	sign	coefficient_abs
34	default_1.0	0.362657	positive	0.362657
14	log_carvalue	0.204723	positive	0.204723
0	debtinc	0.202457	positive	0.202457
15	othdebt	0.126545	positive	0.126545
20	inccat_3.0	0.123089	positive	0.123089
19	inccat_2.0	0.094148	positive	0.094148
10	log_employ	0.083809	positive	0.083809
51	jobsat_5.0	0.076830	positive	0.076830
54	empcat_3.0	0.076712	positive	0.076712
23	carcatvalue_1.0	-0.088082	negative	0.088082
4	log_income	-0.145739	negative	0.145739
22	inccat_5.0	-0.200480	negative	0.200480
17	log_cardspent	-0.205127	negative	0.205127
33	default_0.0	-0.362657	negative	0.362657
11	log_othdebt	-1.202853	negative	1.202853

impactful_coefs = impactful_coefs.sort_values('coefficient_abs', ascending = False)

fig, ax = plt.subplots(figsize = (10,5))
plot = sns.barplot(data = impactful_coefs.head(15), x = 'feature', y = 'coefficient_abs', hue = 'sign', hue_order = ['positive', 'negative'])
plt.title('Feature Weights')
ticks = plot.set_xticklabels(plot.get_xticklabels(),rotation = 90)

linearly separable data

non_impactful_coefs

	feature	coefficient	sign	coefficient_abs
30	agecat_6.0	0.067036	positive	0.067036
25	carcatvalue_3.0	0.062675	positive	0.062675
21	inccat_4.0	0.052980	positive	0.052980
45	card2tenurecat_4.0	0.049983	positive	0.049983
55	empcat_4.0	0.049716	positive	0.049716
2	income	0.049407	positive	0.049407
13	log_wireten	0.027120	positive	0.027120
50	jobsat_4.0	0.026380	positive	0.026380
24	carcatvalue_2.0	0.025407	positive	0.025407
37	cardtenurecat_1.0	0.018748	positive	0.018748
28	agecat_4.0	0.018213	positive	0.018213
35	retire_0.0	0.016799	positive	0.016799
42	card2tenurecat_1.0	0.016073	positive	0.016073
41	cardtenurecat_5.0	0.015285	positive	0.015285
32	carown_1.0	0.011392	positive	0.011392
38	cardtenurecat_2.0	0.009653	positive	0.009653
16	cardspent	0.000938	positive	0.000938
5	cardtenure	0.000441	positive	0.000441
1	card2spent	0.000240	positive	0.000240
8	tollten	0.000107	positive	0.000107
6	wireten	-0.000143	negative	0.000143
12	employ	-0.004097	negative	0.004097
39	cardtenurecat_3.0	-0.010948	negative	0.010948
31	carown_0.0	-0.011392	negative	0.011392
47	jobsat_1.0	-0.013124	negative	0.013124
36	retire_1.0	-0.016799	negative	0.016799
43	card2tenurecat_2.0	-0.016905	negative	0.016905
44	card2tenurecat_3.0	-0.020578	negative	0.020578
26	agecat_2.0	-0.020901	negative	0.020901
27	agecat_3.0	-0.022236	negative	0.022236
3	log_card2spent	-0.026547	negative	0.026547
9	log_tollten	-0.027404	negative	0.027404
56	empcat_5.0	-0.027905	negative	0.027905
46	card2tenurecat_5.0	-0.028574	negative	0.028574
7	carvalue	-0.031053	negative	0.031053
40	cardtenurecat_4.0	-0.032739	negative	0.032739
48	jobsat_2.0	-0.040003	negative	0.040003
52	empcat_1.0	-0.041940	negative	0.041940
29	agecat_5.0	-0.042112	negative	0.042112
49	jobsat_3.0	-0.050082	negative	0.050082
53	empcat_2.0	-0.056583	negative	0.056583
18	inccat_1.0	-0.069737	negative	0.069737

metrics

	model	rmse	mse
0	num_only	2.275009	5.175667
1	nums_and_cats	2.215407	4.908026
2	all_vars_cv_1	2.215407	4.908026
3	num_only_log	2.335687	5.455432
4	logs_and_cats	2.252368	5.073160
5	nums_logs_and_cats	2.085990	4.351353
6	original_logs_cats_hyper_tuned	2.101103	4.414633

sc.stop()
# metrics.to_csv('metrics.csv')

Analysis and Conclusion

The model results mostly seem intuitive. Other debt, whether or not you’ve ever defaulted, and amount spent on card last month are sensible indicators for credit card debt. On the lower end, it seems to be interesting that variables related to employment and job security indicate more credit debt.

There’s likely alot more hyper parameter tuning that can be done to develop a model with less error. Along with that features excluded in my analysis could be included and features determined to be useless could be excluded in a new model. A gradient boosting model could also be run with regression trees to see if they are generally more efffective.

Share on

Twitter Facebook Google+ LinkedIn

Allen Jackson