Coupon Acceptance Classification and Annomaly Detection

Introduction

https://archive.ics.uci.edu/ml/datasets/in-vehicle+coupon+recommendation#

This data set was taken from the UCI Machine Learning Repository at the above link and was gather gathered from a survey on Amazon Mechanical Turk and donated on 9/15/2020. The researchers that uploaded the data were getting a real world data set to predict whether or not a driver would accept a coupon in different situations. Some of the attributes are about the respondent's life circumstances and preferences, and some of them are about the theoretical coupons or driving situations. There are 22 attributes plus one yes or no attribute, and they are all categorical. There are 12684 instances, and there are missing data points. Each instance states whether or not it would accept the coupon.

The data was used to determine a "Bayesian framework for learning rule sets for interpretable classification" for whether or not the respondents would accept the coupon or not; The work was published in the Journal of Machine Learning Research and included many other data sets and techniques. It was primarily focused on the mathematics aspect.

For this project, I primarily wanted to compare different classification techniques and metrics, but also see how removing anomalies affects classification. I also wanted to see if different classifiers were better with different accuracy measures.

Preprocessing

Exploration

Removing Nulls

Classifications Pre-Anomaly Removal

In researching classification techniques, I found this example code. https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

I added these classifiers to the ones I had already been using and changed the classification flow a little.

Sklearn has a ton of information, and I felt like years could be spent to fully learn a few modules. This would definitely be an area of further research, especially to understand

CategoricalNB: naive bayes for categorical values like we learned in class.

BernoulliNB: designed for binary/boolean features.

MultinomialNB: for classification with discrete features. Can work with tf-idf.

ComplementNB: corrects the "severe assumptions" made by MultinomialNB. Suited for imbalanced data.

GaussianNB: like we learned in class. Uses mean and standard variation to calculate probabilities.

Nearest Neighbors: KNearest Neighbors like we learned in class.

Linear SVM: Support Vector Classification with a Linear Kernel. (Best with Low Dimensional Data).

RBF SVM: Support Vector Classification with a Radial Basis Function Kernel. (Best with Low Dimensional Data).

Gaussian Process: Based on Laplace approximation. Machine Learning Technique.

Decision Tree: Can be finicky, requires preprocessing. Verifiable statistically. Let's you see the rules.

Random Forest: Combines many Decision Trees.

Neural Net: Neural Net, default hidden_layer_sizes is 100

AdaBoost: Adaptive Boosting. Can be used in conjunction with others. Weak learners converge to stronger ones. Uses a base classifier Decision Tree.

QDA: Quadratic Discriminant Analyses, assumes normally distributed.

Accuracy: Proportion of Correct Predictions.

Balanced_accuracy: Average of recall for each class.

Average_precision: weighted mean of precisions at each threshold. Uses Probability.

Neg_brier: MSE. Uses Probability. Smaller is Better.

Jaccard: Size of intersection divided by the size of the union.

F1_score: Weighted average of precision and recall.

Classification_Report: gives a string table of Precision, recall, f1-score, and support.

ROC_aud: Area under the Receiver Operating Characteristic Curve. True Positive vs. FPR. (Better with Probabilities)

Plots Pre-Anomaly Removal

Annomaly Detection and Removal

Removing Anomalies

Final Splitting

Reclassification Post-Anomaly Removal

Final Plots Post-Anomaly Removal

Graphs Against Pre-Anomaly Removal

Summary

CategoricalNB: naive bayes for categorical values like we learned in class.

BernoulliNB: designed for binary/boolean features.

MultinomialNB: for classification with discrete features. Can work with tf-idf.

ComplementNB: corrects the "severe assumptions" made by MultinomialNB. Suited for imbalanced data.

GaussianNB: like we learned in class. Uses mean and standard variation to calculate probabilities.

Nearest Neighbors: KNearest Neighbors like we learned in class.

Linear SVM: Support Vector Classification with a Linear Kernel. (Best with Low Dimensional Data).

RBF SVM: Support Vector Classification with a Radial Basis Function Kernel. (Best with Low Dimensional Data).

Gaussian Process: Based on Laplace approximation. Machine Learning Technique.

Decision Tree: Can be finicky, requires preprocessing. Verifiable statistically. Let's you see the rules.

Random Forest: Combines many Decision Trees.

Neural Net: Neural Net, default hidden_layer_sizes is 100

AdaBoost: Adaptive Boosting. Can be used in conjunction with others. Weak learners converge to stronger ones. Uses a base classifier Decision Tree.

QDA: Quadratic Discriminant Analyses, assumes normally distributed.


Accuracy: Proportion of Correct Predictions.

Balanced_accuracy: Average of recall for each class.

Average_precision: weighted mean of precisions at each threshold. Uses Probability.

Neg_brier: MSE. Uses Probability. Smaller is Better.

Jaccard: Size of intersection divided by the size of the union.

F1_score: Weighted average of precision and recall.

Classification_Report: gives a string table of Precision, recall, f1-score, and support.

ROC_aud: Area under the Receiver Operating Characteristic Curve. True Positive vs. FPR. (Better with Probabilities)


The main objective of this project was to compare classification techniques, classification metrics, and the impact of anomaly detection on classification.

Data

https://archive.ics.uci.edu/ml/datasets/in-vehicle+coupon+recommendation# This data set was taken from the UCI Machine Learning Repository at the above link and was gather gathered from a survey on Amazon Mechanical Turk and donated on 9/15/2020. The researchers that uploaded the data were getting a real world data set to predict whether or not a driver would accept a coupon in different situations. Some of the attributes are about the respondent's life circumstances and preferences, and some of them are about the theoretical coupons or driving situations. There are 22 attributes plus one yes or no attribute, and they are all categorical. There are 12684 instances, and there are missing data points. After removing instances with null data, there are 12079 instances.

There was a car attribute that was not supposed to be there, so it was removed because it was empty. The opposite direction attribute was the opposite of the same direction attribute, so it was removed. The toCoupon_GEQ5min was also removed because it was only 1's, leaving 22 attributes plus a yes or no attribute.

Encoding

Most of the attributes lent themselves especially well to label encoding, so I label encoded them all. The naive bayes classifiers probably would not have worked as well with one-hot encoded data, though some of the classifiers would have benefitted from one-hot encoding. An a-priori argument could be made that due to the naivety of the naive bayes classifiers, the finicky natures of the Decision Tree and Random Forest classifiers, and the disdain for high dimensionality of the SVM's and QDA, one-hot encoding could have negative impacts on classifications. One-hot encoding would have probably helped the Nearest Neighbors, Gaussian Process, Neural Net, and AdaBoost classifiers. Care could also be provided to make sure each attribute that made sense to label encode was properly encoded in a manner that made the most sense, but that was not done and might prove useful for improving future analyses.

Classifiers and Metrics

A brief discription of the classifiers and the metrics used is included above. The intent was not necessarily to choose the best classifier and metric, but to choose all that might be applicable to see how they apply to the data and annomaly detection used.

Summary of Analyses and Results

Pre-Anomaly Removal

The Gaussian Process took a long time to train, but it also turned out to be the best classifier. I could have reduced the maximum iterations to save time, but that might have made it a worse classifier. I would not have thought that the Random Forest would do so poorly relative to the Decision Tree, which ended up being one of the best classifiers. It makes sense that the decision tree did so well as the original research on this data set was done with bayesian rule sets, which are quite similar. The differences between with PCA'd data and the regular data was also quite interesting, as the RBF SVM was the top performer followed by the Neural Net. The Gaussian Process also did worse here, and it took less training time. Perhaps if it had trained for a similar amount of time as the regular data, it would have been better.

Comparing the metrics between the PCA transformed data and the regular data, the Neural Net did just as well, and the RBF SVM bested the Gaussian Process in quite a few metrics. Almost every other classifier other than QDA did worse with the PCA transformed data.

Anomaly Removal

Local Outlier Factor was used to detect anomalies. KNearest might have found anomalies better, but I was concerned that it would be biased towards either class. I decided to remove 10% of the data as the best accuracy score was 0.7 and we are trying to find how much removing anomalies helps the classification. I calculated the anomalies on the regular data and the PCA transformed data and removed them respectively. I also moved the anomalies detected by PCA from a copy of the regular data as I thought it might be better at detecting distances with the reduced dimensionality. Looking at the graph, there is a noticeable though slight difference.

Post-Anomaly Removal

For the anomaly removed data, the Neural Net proved the best on the regularly detected anomaly removed data. The regular data with the PCA detected anomalies removed helped every classifer more than the regular anomaly detection except for the Neural Net and QDA. RBF SVM proved to be the best classifer overall on the PCA transformed and anomaly detected data.

Removing the PCA detected anomalies helped every classifier except for the Decision Tree and the QDA. Removing the regularly detected anomalies helped the Neural Net the most, which saw the largest increase in accuracy, and actually hurt most of the other classifiers. Removing the anomalies for the PCA transformed data also helped or didn't hurt the accuracy of all classifiers except the Decision Tree.

Conclusions

I learned a lot about each of the classifiers and metrics in my analyses. I learned more about the classifiers and metrics I learned about in class and how they all work together, and I learned about a lot of classifers that I had never heard of before. It was also interesting to see how the anomaly detection affected classification and how it could be used in this setting.

I was really suprised that RFB SVM did the best overall on the PCA'd data. Despite the lower amount of preserved variance, it was able to get a better classification than any other method. Naive bayes performed well, but I thought it would have performed even better than it did, though I suppose that many of the attributes are not independent and many were not evenly distributed. It was interesting to see how much anomaly removal helped some classifiers over others and how it even hurt some, or how it helped some metrics while not helping others.

The visualizations and comparisons helped me understand the methods and problems a lot better than I did before. I was exposed to a lot of new material that will prove useful in the future; if I'm building a coupon recomendation system in the future, I will know where to go.

If I had more time, there are a lot of parameters to change. I would try to encode the data more uniformly and try one-hot encoding too. Every classifier had many parameters to change. I would try to see if reducing the maximum iterations on the Gaussian Process would decrease accuracy very much. It would be nice if it didn't. Many of the metrics are more informative with confidence levels of predictions, which some of the classifiers can provide. It would be interesting to see how to optimize those metrics. The RBF SVM was quite unexpected and could be further explored. More annomaly detection techniques could be tried, like KNearest. More anomalies or less could be removed to see how that affects classification. It would also be very intruguing to see which attributes annomaly detection removed the most, or which ones affect classification the most. It is also possible that some attributes hurt classification rather than helped it.