Linear Regression with PySpark
Intro
I’m running a Linear Regression on credit card data using PySpark. The objective is to create a regression that predicts credit card debt using several descriptive features. In a previous notebook I explored the dataset and reduced the features down to the few I wanted to test. The dataset itself is only 5000 rows but in the case that the dataset were too large to run on a single machine the data would need to be processed across several machines. Therefore I handle the feature transformations and regressions with PySpark as if this were the case.
Set Up
# importing packages
import findspark
findspark.init()
import pyspark
import pyspark.sql.functions as F
from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import OneHotEncoder, VectorAssembler, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import StringType, DoubleType
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Use this to change pandas row settings. Good for when I need to see entire dataset
# pd.set_option("display.max_rows", None)
pd.set_option('display.max_rows', 100)
# Selected features from earlier data exploration
columns = ['inccat', 'carcatvalue', 'default', 'empcat', 'agecat', 'cardtenurecat', 'card2tenurecat', 'retire', 'carown', 'jobsat', 'creddebt', 'income', 'othdebt', 'carvalue', 'debtinc', 'cardspent', 'card2spent', 'employ', 'tollten', 'wireten', 'cardtenure']
Below I inital Spark, import the data into the spark session and create a temporary view so that I can also query the data with SQL.
sc = pyspark.SparkContext()
# sc.stop()
spark = pyspark.sql.SparkSession(sc)
data = spark.read.csv('credit_card_regression_set.csv', header = True, inferSchema = True)
data.createOrReplaceTempView('data')
These are the variables decided during my initial exploration of the 130 features.
y = 'creddebt'
nums = ['income',
'othdebt',
'carvalue',
'debtinc',
'cardspent',
'card2spent',
'employ',
'tollten',
'wireten',
'cardtenure']
cats = ['inccat',
'carcatvalue',
'default',
'empcat',
'agecat',
'cardtenurecat',
'card2tenurecat',
'retire',
'carown',
'jobsat']
binary = ['default', 'retire']
Feature Engineering
Before I run the model I need to perform transformations that prepare the data for modeling. In this case It is mainly handling null values.
spark.sql("""SELECT inccat, carcatvalue,default, empcat, agecat, cardtenurecat, card2tenurecat, retire,carown, jobsat FROM data""" ).show()
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|inccat|carcatvalue|default|empcat|agecat|cardtenurecat|card2tenurecat|retire|carown|jobsat|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
| 2.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 1.0|
| 1.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 1.0|
| 2.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 0.0| 1.0| 4.0|
| 1.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 2.0|
| 1.0| 1.0| 0.0| 1.0| 3.0| 3.0| 3.0| 0.0| 0.0| 1.0|
| 4.0| -1.0| 0.0| 5.0| 5.0| 5.0| 3.0| 0.0| -1.0| 2.0|
| 4.0| 2.0| 0.0| 3.0| 5.0| 2.0| 2.0| 0.0| 1.0| 2.0|
| 4.0| 3.0| 0.0| 4.0| 4.0| 5.0| 5.0| 0.0| 1.0| 5.0|
| 1.0| 1.0| 0.0| 4.0| 6.0| 5.0| 5.0| 1.0| 1.0| 2.0|
| 4.0| 3.0| 0.0| 5.0| 4.0| 2.0| 2.0| 0.0| 1.0| 4.0|
| 2.0| 2.0| 0.0| 3.0| 5.0| 5.0| 5.0| 0.0| 1.0| 3.0|
| 1.0| 1.0| 0.0| 2.0| 3.0| 3.0| 2.0| 0.0| 1.0| 4.0|
| 3.0| 2.0| 0.0| 4.0| 4.0| 1.0| 1.0| 0.0| 1.0| 5.0|
| 3.0| -1.0| 0.0| 2.0| 5.0| 5.0| 5.0| 0.0| -1.0| 3.0|
| 1.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 1.0| 1.0| 4.0|
| 1.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 1.0| 1.0| 5.0|
| 5.0| 3.0| 0.0| 5.0| 5.0| 5.0| 4.0| 0.0| 1.0| 5.0|
| 5.0| 3.0| 0.0| 5.0| 5.0| 5.0| 5.0| 0.0| 1.0| 4.0|
| 1.0| -1.0| 0.0| 2.0| 3.0| 3.0| 3.0| 0.0| -1.0| 1.0|
| 1.0| 1.0| 0.0| 3.0| 6.0| 3.0| 3.0| 1.0| 1.0| 3.0|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
only showing top 20 rows
The categorical variables have null values listed as ‘-1.0’. These null values need to be handled in order for the regression to run correctly. In this case I decided to handle them to reassigning them to the most frequent value. The code below iterates through the categorical columns, queries the column for the most frequent value and replaces all null cases with that value.
for column in cats:
# column = 'carcatvalue'
data = data.withColumn(column, data[column].cast(StringType()))
query = """SELECT {0}, COUNT({0}) as count FROM data GROUP BY {0} ORDER BY count DESC LIMIT 1 """.format(column)
frequent = str(spark.sql(query).first()[0])
data = data.withColumn(column, F.regexp_replace(column, '-1.0', frequent))
data.select(cats).show()
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
|inccat|carcatvalue|default|empcat|agecat|cardtenurecat|card2tenurecat|retire|carown|jobsat|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
| 2.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 1.0|
| 1.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 1.0|
| 2.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 0.0| 1.0| 4.0|
| 1.0| 1.0| 1.0| 1.0| 2.0| 2.0| 2.0| 0.0| 1.0| 2.0|
| 1.0| 1.0| 0.0| 1.0| 3.0| 3.0| 3.0| 0.0| 0.0| 1.0|
| 4.0| 1.0| 0.0| 5.0| 5.0| 5.0| 3.0| 0.0| 1.0| 2.0|
| 4.0| 2.0| 0.0| 3.0| 5.0| 2.0| 2.0| 0.0| 1.0| 2.0|
| 4.0| 3.0| 0.0| 4.0| 4.0| 5.0| 5.0| 0.0| 1.0| 5.0|
| 1.0| 1.0| 0.0| 4.0| 6.0| 5.0| 5.0| 1.0| 1.0| 2.0|
| 4.0| 3.0| 0.0| 5.0| 4.0| 2.0| 2.0| 0.0| 1.0| 4.0|
| 2.0| 2.0| 0.0| 3.0| 5.0| 5.0| 5.0| 0.0| 1.0| 3.0|
| 1.0| 1.0| 0.0| 2.0| 3.0| 3.0| 2.0| 0.0| 1.0| 4.0|
| 3.0| 2.0| 0.0| 4.0| 4.0| 1.0| 1.0| 0.0| 1.0| 5.0|
| 3.0| 1.0| 0.0| 2.0| 5.0| 5.0| 5.0| 0.0| 1.0| 3.0|
| 1.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 1.0| 1.0| 4.0|
| 1.0| 1.0| 0.0| 5.0| 6.0| 5.0| 5.0| 1.0| 1.0| 5.0|
| 5.0| 3.0| 0.0| 5.0| 5.0| 5.0| 4.0| 0.0| 1.0| 5.0|
| 5.0| 3.0| 0.0| 5.0| 5.0| 5.0| 5.0| 0.0| 1.0| 4.0|
| 1.0| 1.0| 0.0| 2.0| 3.0| 3.0| 3.0| 0.0| 1.0| 1.0|
| 1.0| 1.0| 0.0| 3.0| 6.0| 3.0| 3.0| 1.0| 1.0| 3.0|
+------+-----------+-------+------+------+-------------+--------------+------+------+------+
only showing top 20 rows
The numerical columns have a similar issue. Null values are listed as ‘-1.0’. I’ll use similar code except I will use the mean rather than the mode in the case of numeric features.
for column in nums:
data = data.withColumn(column, data[column].cast(StringType()))
query = """SELECT MEAN({0}) FROM data""".format(column)
mean = str(spark.sql(query).first()[0])
data = data.withColumn(column, data[column].cast(StringType()))
data = data.withColumn(column, F.regexp_replace(column, '-1.0', mean))
data = data.withColumn(column, data[column].cast(DoubleType()))
data.select(nums).show()
+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
|income|othdebt| carvalue|debtinc|cardspent|card2spent|employ|tollten|wireten|cardtenure|
+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
| 31.0| 2.24| 14.3| 11.1| 81.66| 67.8| 0.0| 161.05| 0.0| 2.0|
| 15.0| 1.57| 6.8| 18.6| 42.6| 34.94| 0.0| 0.0|1683.55| 4.0|
| 35.0| 2.54| 18.8| 9.9| 184.22| 175.75| 16.0| 0.0| 0.0| 35.0|
| 20.0| 1.12| 8.7| 5.7| 340.99| 18.42| 0.0| 0.0| 0.0| 5.0|
| 23.0| 0.18| 10.6| 1.7| 255.1| 252.73| 1.0| 387.7| 410.8| 8.0|
| 107.0| 4.93|23.232580000000013| 5.6| 228.27| 0.0| 22.0| 726.6| 0.0| 18.0|
| 77.0| 0.96| 25.6| 1.9| 822.32| 130.14| 10.0| 0.0| 0.0| 3.0|
| 97.0| 8.02| 55.5| 14.4| 592.7| 712.1| 11.0| 1110.4| 0.0| 25.0|
| 16.0| 0.31| 8.6| 2.6| 326.59| 141.24| 15.0| 0.0| 0.0| 26.0|
| 84.0| 1.67| 41.0| 4.1| 199.64| 111.17| 19.0| 0.0| 0.0| 2.0|
| 47.0| 2.68| 28.0| 8.6| 488.97| 322.07| 8.0| 0.0| 0.0| 16.0|
| 19.0| 0.04| 9.3| 0.9| 338.26| 55.17| 4.0| 0.0| 0.0| 6.0|
| 73.0| 0.79| 38.5| 2.8| 534.36| 198.39| 12.0| 0.0| 0.0| 1.0|
| 63.0| 4.69|23.232580000000013| 10.5| 593.5| 384.94| 3.0| 1776.7| 0.0| 25.0|
| 17.0| 0.39| 9.3| 9.8| 233.17| 39.93| 27.0| 0.0| 0.0| 33.0|
| 23.0| 1.2| 11.6| 9.3| 297.47| 98.65| 31.0| 1919.0|3159.25| 40.0|
| 171.0| 14.44| 75.8| 9.5| 305.94| 96.23| 24.0|1444.65|1840.45| 19.0|
| 424.0| 32.26| 88.6| 10.7| 495.75| 798.31| 29.0|3165.45| 0.0| 36.0|
| 23.0| 0.91|23.232580000000013| 4.8| 442.09| 144.69| 4.0|1099.65| 0.0| 10.0|
| 22.0| 2.65| 10.8| 15.2| 8.11| 0.0| 10.0| 0.0| 0.0| 10.0|
+------+-------+------------------+-------+---------+----------+------+-------+-------+----------+
only showing top 20 rows
data_train, data_test = data.randomSplit([.80, .20], seed = 4)
Creating the Pipeline
With the data clean enough for modeling I need to create a pipeline that runs the regression. Techincally this process includes additional feature engineering as the categorical variables need to be indexed and one hot encoded. Finally, all of the features need to be assembled into a single vector that PySpark is capable of reading.
# Index and encoding objects
index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)
encoded_list = [cat + '_encoded' for cat in cats]
one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)
# This code block allows me to decide features. I comment the extend methods that I wish to exclude when running the model
inputs = []
inputs.extend(nums)
inputs.extend(encoded_list)
inputs
['income',
'othdebt',
'carvalue',
'debtinc',
'cardspent',
'card2spent',
'employ',
'tollten',
'wireten',
'cardtenure',
'inccat_encoded',
'carcatvalue_encoded',
'default_encoded',
'empcat_encoded',
'agecat_encoded',
'cardtenurecat_encoded',
'card2tenurecat_encoded',
'retire_encoded',
'carown_encoded',
'jobsat_encoded']
# This code creates the assembler and regression objects.
assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')
regression = LinearRegression(labelCol='creddebt')
# pipeline object contains previously created objects
pipepline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])
# model object fits the pipeline to the training data. predictions
model = pipepline.fit(data_train)
predictions = model.transform(data_test)
Evaluation
The table below shows the model predictions compared to the actual credit debt. ‘Creddebt’ is listed in the thousands. Therefore in the case of the first observation, the model predicted that this observation with an actual debt of 140 dollars would actually be about 600 dollars in debt.
predictions.select(['creddebt','prediction']).show()
+--------+--------------------+
|creddebt| prediction|
+--------+--------------------+
| 0.14| 0.6093043679675625|
| 0.24| 0.43905437552333204|
| 0.52|0.012254495737605176|
| 0.39| 1.507075165380026|
| 1.13| 3.206200072134242|
| 0.02| -0.7176964758160933|
| 0.45| 0.8612760486636706|
| 2.46| 2.5064292029472672|
| 6.61| 4.050840955195323|
| 44.25| 13.44364556197883|
| 0.15| -0.5992770881744303|
| 8.15| 5.87001498551057|
| 0.39|-0.26548513694924303|
| 0.78| 1.7494149226058622|
| 0.34| -0.4892234737470511|
| 29.47| 9.280357979257039|
| 0.11| 0.22478571931758262|
| 4.73| 6.389449128368286|
| 0.78| 1.3024720830240581|
| 1.77| 1.021364413293012|
+--------+--------------------+
only showing top 20 rows
# Evaluation checks the total error of the model based on an evaluation metric in this case the metric is root mean squared error
evaluation = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt').evaluate(predictions)
evaluation
2.2154065173974407
Documentation
I created a dataframe and a function to add the results of each prediction to my model to document the models improvement as I change parameters and features.
metrics = pd.DataFrame({'model': [], 'rmse':[], 'mse' : []})
# metrics.to_csv('metrics.csv')
# metrics = pd.read_csv('metrics.csv')
# metrics = metrics.iloc[:,1:]
# These functions evaluate the rmse and mse of the column and add them to my metrics dataframe.
def add_row(df, row):
df.loc[len(df)] = row
def add_model_metrics(df, model_name, predictions):
rmse = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt').evaluate(predictions)
mse = RegressionEvaluator(metricName = 'mse', labelCol='creddebt').evaluate(predictions)
row = [model_name, rmse, mse]
add_row(df, row)
# add_model_metrics(metrics, 'nums_and_cats', predictions)
These are the first two models I ran one with only the numerical variables and a second with both the numerical and categorical variables. The first model has a rmse of 2.275 which suggests that the model is off by about 2270 dollars on average. With the categorical variables, the model is 60 dollars more effective on average when predicted on the test data.
metrics
model | rmse | mse | |
---|---|---|---|
0 | num_only | 2.275009 | 5.175667 |
1 | nums_and_cats | 2.215407 | 4.908026 |
Additional Features
Cross Validation
Adding cross validation (5 fold) to potentially increase model performance. Code below is similar to previous pipeline code with some additions for cross validation.
index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)
encoded_list = [cat + '_encoded' for cat in cats]
one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)
assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')
regression = LinearRegression(labelCol='creddebt')
pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])
evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')
paramGrid = ParamGridBuilder().build()
cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)
cvModel = CrossValidator.fit(cvModel, data_train)
predictions = cvModel.transform(data_test)
add_model_metrics(metrics, 'all_vars_cv_1', predictions)
metrics
model | rmse | mse | |
---|---|---|---|
0 | num_only | 2.275009 | 5.175667 |
1 | nums_and_cats | 2.215407 | 4.908026 |
2 | all_vars_cv_1 | 2.215407 | 4.908026 |
It appears that cross validation had no effect on the model performance.
Log Transformation
There are a few feature adjustments that I could potentially improve the model. First, some of the continuous variables aren’t normally distributed. Applying a log transformation to some of the continuous variables notably changed the distribution to be more normal.
to_be_logged = ['income',
'othdebt',
'carvalue',
'cardspent',
'card2spent',
'employ',
'tollten',
'wireten']
reg_nums = ['debtinc',
'cardtenure']
log_nums = ['log_' + column for column in to_be_logged]
# This code goes through the to_be_logged list and applies the log transformation.
from pyspark.sql.functions import log
for column in to_be_logged:
data = data.withColumn('log_' + column, log(data[column] + 1))
# Have to reinitialize training data to include new log values.
data_train, data_test = data.randomSplit([.80, .20], seed = 4)
# Ran this three times once with only continuous and a second time with categorical variables third time with both log and continuous
inputs = []
inputs.extend(log_nums)
inputs.extend(reg_nums)
index_list = [cat +'_idx' for cat in cats]
inputs.extend(index_list)
inputs.extend(to_be_logged)
string_index = StringIndexer(inputCols = cats, outputCols = index_list)
encoded_list = [cat + '_encoded' for cat in cats]
one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)
assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')
regression = LinearRegression(labelCol='creddebt')
pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])
evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')
paramGrid = ParamGridBuilder().build()
cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)
cvModel = CrossValidator.fit(cvModel, data_train)
predictions = cvModel.transform(data_test)
# log only
predictions.select(['creddebt','prediction']).show()
+--------+--------------------+
|creddebt| prediction|
+--------+--------------------+
| 0.14| 1.647075423659162|
| 0.24| 0.6398959925978822|
| 0.52| 1.0997906096953702|
| 0.39| 0.3504364866826766|
| 1.13| 2.9820424472769513|
| 0.02| -1.9022647714346022|
| 0.45| 0.7227840152447431|
| 2.46| 3.6401236181581247|
| 6.61| 4.143968229124148|
| 44.25| 9.636200858653792|
| 0.15| -1.18055814271729|
| 8.15| 7.317615336208826|
| 0.39| -0.4845952073333475|
| 0.78| 1.3886183932839593|
| 0.34| -0.342778877833398|
| 29.47| 8.245049772042242|
| 0.11|-0.11961579436530378|
| 4.73| 4.785692639501264|
| 0.78|-0.01416733325522...|
| 1.77| 1.573471743150824|
+--------+--------------------+
only showing top 20 rows
# logs_and_cats
predictions.select(['creddebt','prediction']).show()
+--------+-------------------+
|creddebt| prediction|
+--------+-------------------+
| 0.14| 1.2884720002611854|
| 0.24|-0.1333820961933423|
| 0.52| 0.6504041999093033|
| 0.39| 1.1970963328240192|
| 1.13| 2.7496023407588517|
| 0.02|-1.4263460324400405|
| 0.45| -0.19447120947091|
| 2.46| 3.573529458318564|
| 6.61| 5.61117129623733|
| 44.25| 10.727084238308777|
| 0.15|-2.0566687769911294|
| 8.15| 7.613715198840104|
| 0.39|-1.2405597564490094|
| 0.78| 2.6866694377778852|
| 0.34| -1.012504694987344|
| 29.47| 9.559866729767748|
| 0.11|-0.5206746091395491|
| 4.73| 5.0217777453217884|
| 0.78| 0.7543957441303526|
| 1.77| 0.81080977472668|
+--------+-------------------+
only showing top 20 rows
# nums_logs_and_cats
predictions.select(['creddebt','prediction']).show()
+--------+-------------------+
|creddebt| prediction|
+--------+-------------------+
| 0.14|0.49792826424154235|
| 0.24|0.41489351806923125|
| 0.52| 1.1100405323315137|
| 0.39| 1.0485681882122222|
| 1.13| 1.8889272517262299|
| 0.02|-0.6776597182744579|
| 0.45|0.48076946156178746|
| 2.46| 3.4513634556554464|
| 6.61| 4.029615100979548|
| 44.25| 14.408801210838694|
| 0.15|-0.8155181450877169|
| 8.15| 7.228276573264578|
| 0.39|-0.1515153538643217|
| 0.78| 0.9778442365562707|
| 0.34|0.41171528082751685|
| 29.47| 10.902899075433812|
| 0.11|-0.5355067880359905|
| 4.73| 5.778526907345183|
| 0.78| 1.628078117979622|
| 1.77| 1.4989848161401254|
+--------+-------------------+
only showing top 20 rows
add_model_metrics(metrics, 'nums_logs_and_cats', predictions)
metrics
model | rmse | mse | |
---|---|---|---|
0 | num_only | 2.275009 | 5.175667 |
1 | nums_and_cats | 2.215407 | 4.908026 |
2 | all_vars_cv_1 | 2.215407 | 4.908026 |
3 | num_only_log | 2.335687 | 5.455432 |
4 | logs_and_cats | 2.252368 | 5.073160 |
5 | nums_logs_and_cats | 2.085990 | 4.351353 |
Running all of the continous, log adjusted variables, and categorical variables together seems to have gained the best overall performance.
Regularization
The last thing I will try is a grid search with some hyperparameter adjustments. This will include the regularization weight along with the elastic net which balances between the l1 and l2 regularization types. I will add the best iteration to my metrics dataframe.
index_list = [cat +'_idx' for cat in cats]
string_index = StringIndexer(inputCols = cats, outputCols = index_list)
encoded_list = [cat + '_encoded' for cat in cats]
one_h_encode = OneHotEncoder(inputCols = index_list, outputCols = encoded_list)
assembler = VectorAssembler(inputCols = inputs, outputCol = 'features')
regression = LinearRegression(labelCol='creddebt')
pipeline = Pipeline(stages = [string_index, one_h_encode, assembler, regression])
evaluator = RegressionEvaluator(metricName = 'rmse', labelCol='creddebt')
paramGrid = ParamGridBuilder().addGrid(regression.regParam, [0.1, 0.01, .001]).addGrid(regression.fitIntercept, [True])
paramGrid = paramGrid.addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0]).build()
cvModel = CrossValidator(estimator = pipeline, evaluator = evaluator, numFolds = 4, estimatorParamMaps = paramGrid)
cvModel = CrossValidator.fit(cvModel, data_train)
predictions = cvModel.transform(data_test)
add_model_metrics(metrics, 'original_logs_cats_hyper_tuned', predictions)
metrics
model | rmse | mse | |
---|---|---|---|
0 | num_only | 2.275009 | 5.175667 |
1 | nums_and_cats | 2.215407 | 4.908026 |
2 | all_vars_cv_1 | 2.215407 | 4.908026 |
3 | num_only_log | 2.335687 | 5.455432 |
4 | logs_and_cats | 2.252368 | 5.073160 |
5 | nums_logs_and_cats | 2.085990 | 4.351353 |
6 | original_logs_cats_hyper_tuned | 2.101103 | 4.414633 |
print('Best Regularization Parameter: ', cvModel.bestModel.stages[-1]._java_obj.parent().getRegParam())
print('Best ElasticNet Parameter: ',cvModel.bestModel.stages[-1]._java_obj.parent().getElasticNetParam())
Best Regularization Parameter: 0.1
Best ElasticNet Parameter: 0.0
The model chosen was a ridge regression with a strength of .1. Although it performs worse than the previous model it is likely that the ridge regression is more generalizable. I’ll stick with this as the final model when interpreting variables.
Best Model Interpretation
In order to interpret the model I will run a final regression in pandas using the determined parameters. This will allow me to visualize and interpret the coefficients at a more individual level.
pd_data_train = data_train.toPandas()
pd_data_test = data_test.toPandas()
inputs = []
inputs.extend(log_nums)
inputs.extend(reg_nums)
inputs.extend(cats)
inputs.extend(to_be_logged)
inputs.append('creddebt')
inputs = set(inputs)
pd_data_train = pd_data_train[inputs]
pd_data_test = pd_data_test[inputs]
pd_data_train = pd.get_dummies(pd_data_train)
X_train = pd_data_train.drop('creddebt', axis = 1)
y_train = pd_data_train['creddebt']
pd_data_test = pd.get_dummies(pd_data_test)
X_test = pd_data_test.drop('creddebt', axis = 1)
y_test = pd_data_test['creddebt']
pd_data_train
debtinc | card2spent | income | log_card2spent | log_income | cardtenure | wireten | carvalue | tollten | log_tollten | ... | jobsat_1.0 | jobsat_2.0 | jobsat_3.0 | jobsat_4.0 | jobsat_5.0 | empcat_1.0 | empcat_2.0 | empcat_3.0 | empcat_4.0 | empcat_5.0 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2.0 | 114.70 | 85.0 | 4.751001 | 4.454347 | 15.0 | 621.90 | 43.10000 | 554.75 | 6.320319 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 13.1 | 66.60 | 39.0 | 4.213608 | 3.688879 | 31.0 | 0.00 | 10.00000 | 0.00 | 0.000000 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 6.2 | 394.15 | 138.0 | 5.979265 | 4.934474 | 1.0 | 30.05 | 66.20000 | 20.50 | 3.068053 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 25.1 | 197.63 | 63.0 | 5.291444 | 4.158883 | 17.0 | 0.00 | 39.30000 | 0.00 | 0.000000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 9.2 | 214.91 | 37.0 | 5.374862 | 3.637586 | 19.0 | 0.00 | 15.80000 | 0.00 | 0.000000 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3982 | 5.9 | 839.40 | 84.0 | 6.733878 | 4.442651 | 15.0 | 887.50 | 30.60000 | 557.55 | 6.325344 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3983 | 0.8 | 156.35 | 47.0 | 5.058473 | 3.871201 | 19.0 | 0.00 | 22.40000 | 0.00 | 0.000000 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3984 | 1.3 | 84.25 | 31.0 | 4.445588 | 3.465736 | 9.0 | 0.00 | 17.10000 | 0.00 | 0.000000 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3985 | 19.5 | 56.98 | 26.0 | 4.060098 | 3.295837 | 6.0 | 0.00 | 11.60000 | 0.00 | 0.000000 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3986 | 13.4 | 211.04 | 36.0 | 5.356775 | 3.610918 | 3.0 | 380.15 | 23.23258 | 108.85 | 4.699116 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3987 rows × 58 columns
from sklearn.linear_model import ElasticNet
lr = ElasticNet(.1, 0)
lr.fit(X_train, y_train)
C:\Users\Allen\anaconda3\lib\site-packages\sklearn\utils\validation.py:71: FutureWarning: Pass l1_ratio=0 as keyword args. From version 0.25 passing these as positional arguments will result in an error
FutureWarning)
C:\Users\Allen\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:531: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 8504.175821644329, tolerance: 4.811267459974919
positive)
ElasticNet(alpha=0.1, l1_ratio=0)
predictions = lr.predict(X_test)
Below I create a dataframe with the coefficient data. I create a column ‘sign’ that is meant to record whether or not the variable is positive or negative. I also create a ‘coefficient_abs’ column which transforms the variable into its absolute value. These columns will be useful for visualizing and comparing all of the features based on their impact and usefulness to the model.
coefficient_data = pd.DataFrame({'feature' : X_train.columns, 'coefficient' : lr.coef_})
coefficient_data = coefficient_data.sort_values('coefficient', ascending = False)
coefficient_data['sign'] = coefficient_data['coefficient'].apply(lambda n: 'positive' if n > 0 else 'negative')
coefficient_data['coefficient_abs'] = coefficient_data['coefficient'].apply(lambda n: np.abs(n))
coefficient_data
feature | coefficient | sign | coefficient_abs | |
---|---|---|---|---|
34 | default_1.0 | 0.362657 | positive | 0.362657 |
14 | log_carvalue | 0.204723 | positive | 0.204723 |
0 | debtinc | 0.202457 | positive | 0.202457 |
15 | othdebt | 0.126545 | positive | 0.126545 |
20 | inccat_3.0 | 0.123089 | positive | 0.123089 |
19 | inccat_2.0 | 0.094148 | positive | 0.094148 |
10 | log_employ | 0.083809 | positive | 0.083809 |
51 | jobsat_5.0 | 0.076830 | positive | 0.076830 |
54 | empcat_3.0 | 0.076712 | positive | 0.076712 |
30 | agecat_6.0 | 0.067036 | positive | 0.067036 |
25 | carcatvalue_3.0 | 0.062675 | positive | 0.062675 |
21 | inccat_4.0 | 0.052980 | positive | 0.052980 |
45 | card2tenurecat_4.0 | 0.049983 | positive | 0.049983 |
55 | empcat_4.0 | 0.049716 | positive | 0.049716 |
2 | income | 0.049407 | positive | 0.049407 |
13 | log_wireten | 0.027120 | positive | 0.027120 |
50 | jobsat_4.0 | 0.026380 | positive | 0.026380 |
24 | carcatvalue_2.0 | 0.025407 | positive | 0.025407 |
37 | cardtenurecat_1.0 | 0.018748 | positive | 0.018748 |
28 | agecat_4.0 | 0.018213 | positive | 0.018213 |
35 | retire_0.0 | 0.016799 | positive | 0.016799 |
42 | card2tenurecat_1.0 | 0.016073 | positive | 0.016073 |
41 | cardtenurecat_5.0 | 0.015285 | positive | 0.015285 |
32 | carown_1.0 | 0.011392 | positive | 0.011392 |
38 | cardtenurecat_2.0 | 0.009653 | positive | 0.009653 |
16 | cardspent | 0.000938 | positive | 0.000938 |
5 | cardtenure | 0.000441 | positive | 0.000441 |
1 | card2spent | 0.000240 | positive | 0.000240 |
8 | tollten | 0.000107 | positive | 0.000107 |
6 | wireten | -0.000143 | negative | 0.000143 |
12 | employ | -0.004097 | negative | 0.004097 |
39 | cardtenurecat_3.0 | -0.010948 | negative | 0.010948 |
31 | carown_0.0 | -0.011392 | negative | 0.011392 |
47 | jobsat_1.0 | -0.013124 | negative | 0.013124 |
36 | retire_1.0 | -0.016799 | negative | 0.016799 |
43 | card2tenurecat_2.0 | -0.016905 | negative | 0.016905 |
44 | card2tenurecat_3.0 | -0.020578 | negative | 0.020578 |
26 | agecat_2.0 | -0.020901 | negative | 0.020901 |
27 | agecat_3.0 | -0.022236 | negative | 0.022236 |
3 | log_card2spent | -0.026547 | negative | 0.026547 |
9 | log_tollten | -0.027404 | negative | 0.027404 |
56 | empcat_5.0 | -0.027905 | negative | 0.027905 |
46 | card2tenurecat_5.0 | -0.028574 | negative | 0.028574 |
7 | carvalue | -0.031053 | negative | 0.031053 |
40 | cardtenurecat_4.0 | -0.032739 | negative | 0.032739 |
48 | jobsat_2.0 | -0.040003 | negative | 0.040003 |
52 | empcat_1.0 | -0.041940 | negative | 0.041940 |
29 | agecat_5.0 | -0.042112 | negative | 0.042112 |
49 | jobsat_3.0 | -0.050082 | negative | 0.050082 |
53 | empcat_2.0 | -0.056583 | negative | 0.056583 |
18 | inccat_1.0 | -0.069737 | negative | 0.069737 |
23 | carcatvalue_1.0 | -0.088082 | negative | 0.088082 |
4 | log_income | -0.145739 | negative | 0.145739 |
22 | inccat_5.0 | -0.200480 | negative | 0.200480 |
17 | log_cardspent | -0.205127 | negative | 0.205127 |
33 | default_0.0 | -0.362657 | negative | 0.362657 |
11 | log_othdebt | -1.202853 | negative | 1.202853 |
coefficient_data.describe()
coefficient | coefficient_abs | |
---|---|---|
count | 57.000000 | 57.000000 |
mean | -0.016617 | 0.082216 |
std | 0.188638 | 0.170250 |
min | -1.202853 | 0.000107 |
25% | -0.028574 | 0.016799 |
50% | 0.000107 | 0.031053 |
75% | 0.049407 | 0.076712 |
max | 0.362657 | 1.202853 |
Looking at the coefficient data I decide that using the 75% marker on the magnitude makes the most sense and filter at coefficients that don’t have a beta value larger than .07. I then visualize these ‘impactful coefficients’ below.
impactful_coefs = coefficient_data[coefficient_data['coefficient_abs'] > .07]
non_impactful_coefs = coefficient_data[coefficient_data['coefficient_abs'] < .07]
impactful_coefs
feature | coefficient | sign | coefficient_abs | |
---|---|---|---|---|
34 | default_1.0 | 0.362657 | positive | 0.362657 |
14 | log_carvalue | 0.204723 | positive | 0.204723 |
0 | debtinc | 0.202457 | positive | 0.202457 |
15 | othdebt | 0.126545 | positive | 0.126545 |
20 | inccat_3.0 | 0.123089 | positive | 0.123089 |
19 | inccat_2.0 | 0.094148 | positive | 0.094148 |
10 | log_employ | 0.083809 | positive | 0.083809 |
51 | jobsat_5.0 | 0.076830 | positive | 0.076830 |
54 | empcat_3.0 | 0.076712 | positive | 0.076712 |
23 | carcatvalue_1.0 | -0.088082 | negative | 0.088082 |
4 | log_income | -0.145739 | negative | 0.145739 |
22 | inccat_5.0 | -0.200480 | negative | 0.200480 |
17 | log_cardspent | -0.205127 | negative | 0.205127 |
33 | default_0.0 | -0.362657 | negative | 0.362657 |
11 | log_othdebt | -1.202853 | negative | 1.202853 |
impactful_coefs = impactful_coefs.sort_values('coefficient_abs', ascending = False)
fig, ax = plt.subplots(figsize = (10,5))
plot = sns.barplot(data = impactful_coefs.head(15), x = 'feature', y = 'coefficient_abs', hue = 'sign', hue_order = ['positive', 'negative'])
plt.title('Feature Weights')
ticks = plot.set_xticklabels(plot.get_xticklabels(),rotation = 90)
non_impactful_coefs
feature | coefficient | sign | coefficient_abs | |
---|---|---|---|---|
30 | agecat_6.0 | 0.067036 | positive | 0.067036 |
25 | carcatvalue_3.0 | 0.062675 | positive | 0.062675 |
21 | inccat_4.0 | 0.052980 | positive | 0.052980 |
45 | card2tenurecat_4.0 | 0.049983 | positive | 0.049983 |
55 | empcat_4.0 | 0.049716 | positive | 0.049716 |
2 | income | 0.049407 | positive | 0.049407 |
13 | log_wireten | 0.027120 | positive | 0.027120 |
50 | jobsat_4.0 | 0.026380 | positive | 0.026380 |
24 | carcatvalue_2.0 | 0.025407 | positive | 0.025407 |
37 | cardtenurecat_1.0 | 0.018748 | positive | 0.018748 |
28 | agecat_4.0 | 0.018213 | positive | 0.018213 |
35 | retire_0.0 | 0.016799 | positive | 0.016799 |
42 | card2tenurecat_1.0 | 0.016073 | positive | 0.016073 |
41 | cardtenurecat_5.0 | 0.015285 | positive | 0.015285 |
32 | carown_1.0 | 0.011392 | positive | 0.011392 |
38 | cardtenurecat_2.0 | 0.009653 | positive | 0.009653 |
16 | cardspent | 0.000938 | positive | 0.000938 |
5 | cardtenure | 0.000441 | positive | 0.000441 |
1 | card2spent | 0.000240 | positive | 0.000240 |
8 | tollten | 0.000107 | positive | 0.000107 |
6 | wireten | -0.000143 | negative | 0.000143 |
12 | employ | -0.004097 | negative | 0.004097 |
39 | cardtenurecat_3.0 | -0.010948 | negative | 0.010948 |
31 | carown_0.0 | -0.011392 | negative | 0.011392 |
47 | jobsat_1.0 | -0.013124 | negative | 0.013124 |
36 | retire_1.0 | -0.016799 | negative | 0.016799 |
43 | card2tenurecat_2.0 | -0.016905 | negative | 0.016905 |
44 | card2tenurecat_3.0 | -0.020578 | negative | 0.020578 |
26 | agecat_2.0 | -0.020901 | negative | 0.020901 |
27 | agecat_3.0 | -0.022236 | negative | 0.022236 |
3 | log_card2spent | -0.026547 | negative | 0.026547 |
9 | log_tollten | -0.027404 | negative | 0.027404 |
56 | empcat_5.0 | -0.027905 | negative | 0.027905 |
46 | card2tenurecat_5.0 | -0.028574 | negative | 0.028574 |
7 | carvalue | -0.031053 | negative | 0.031053 |
40 | cardtenurecat_4.0 | -0.032739 | negative | 0.032739 |
48 | jobsat_2.0 | -0.040003 | negative | 0.040003 |
52 | empcat_1.0 | -0.041940 | negative | 0.041940 |
29 | agecat_5.0 | -0.042112 | negative | 0.042112 |
49 | jobsat_3.0 | -0.050082 | negative | 0.050082 |
53 | empcat_2.0 | -0.056583 | negative | 0.056583 |
18 | inccat_1.0 | -0.069737 | negative | 0.069737 |
metrics
model | rmse | mse | |
---|---|---|---|
0 | num_only | 2.275009 | 5.175667 |
1 | nums_and_cats | 2.215407 | 4.908026 |
2 | all_vars_cv_1 | 2.215407 | 4.908026 |
3 | num_only_log | 2.335687 | 5.455432 |
4 | logs_and_cats | 2.252368 | 5.073160 |
5 | nums_logs_and_cats | 2.085990 | 4.351353 |
6 | original_logs_cats_hyper_tuned | 2.101103 | 4.414633 |
sc.stop()
# metrics.to_csv('metrics.csv')
Analysis and Conclusion
The model results mostly seem intuitive. Other debt, whether or not you’ve ever defaulted, and amount spent on card last month are sensible indicators for credit card debt. On the lower end, it seems to be interesting that variables related to employment and job security indicate more credit debt.
There’s likely alot more hyper parameter tuning that can be done to develop a model with less error. Along with that features excluded in my analysis could be included and features determined to be useless could be excluded in a new model. A gradient boosting model could also be run with regression trees to see if they are generally more efffective.