Source Code
Current Nanodegree Projects
Enron Corporation was one of the world’s major electricity, natural gas, communications and pulp and paper companies with approximately 20,000 staff before its bankruptcy at the end of 20011. Accounting fraud perpetrated by top executives resulted in one of the largest bankruptcies in U.S. History.
Enron is also unique in that over 600,000 typically confidential emails from 158 employees were released after the bankruptcy.2 3 Detailed financial records of many executives were also released during the fraud trials. 4
For this project, predictive models were built using scikit learn5, numpy6, and pandas7 modules in Python. The target of the predictions were persons-of-interest (POI’s) who were ‘individuals who were indicted, reached a settlement, or plea deal with the government, or testified in exchange for prosecution immunity.’ Financial compensation data and aggregate email statistics from the Enron Corpus were used as features for prediction.
Summarize for us the goal of this project and how machine learning is useful in trying to accomplish it. As part of your answer, give some background on the dataset and how it can be used to answer the project question. Were there any outliers in the data when you got it, and how did you handle those?
The goal of this project wass the build a prediction model to identify persons-of-interest (POI’s.) There were 146 total records and 18 POIs in the original dataset. I tried to perform as little data snooping 8 as possible when filtering obvious outliers and problem records.
TOTAL was removed as it was simply a record totaling all of the financial statistics from the financial data.
Eugene E. Lockhart was removed during data processing since this row had no entries for any feature.
note: The Travel Agency in the Park was found after the fact but not removed since data snooping might have potentially played a role in this decision.
I had to be careful to not go looking deep into the characteristics of each feature since there was no explicit hold-out testing set, and any record could be included in both training and testing depending on how each split was made in cross-validation. After these two records were removed, there were 144 remaining data points to use for prediction.
Records for Robert Belfer and Sanjay Bhatnagar were identified as out-of-sync when they introduced erroneous data during feature creation.
These two records were fixed to be in sync with the PDF9 file of financial data. All other records were also validated for accuracy as well making sure the totals added up correctly to the PDF spreadsheet that the data came from.
What features did you end up using in your POI identifier, and what selection process did you use to pick them? Did you have to do any scaling? Why or why not? As part of the assignment, you should attempt to engineer your own feature that doesn’t come ready-made in the dataset–explain what feature you tried to make, and the rationale behind it. (You do not necessarily have to use it in the final analysis, only engineer and test it.) If you used an algorithm like a decision tree, please also give the feature importances of the features that you use.
Totals, ratios, and exponential values of financial and email data were added to the data set. Added features were continually removed as long as the cross-validated scores of selected models went up.
For the final optimal model, only totals were kept since these were the only added features which consistently provided an increase in the evaluation metrics. This is most likely because introducing so many other new features into such a small dataset added a lot of mulit-collinearity which negatively affected model selection.
During the **GridSearchCV10 pipeline11 search, all features were first scaled to be between 0 and 1 using a MinMaxScaler12 since PCA and various models such as Logistic Regression perform optimally with scaled features. The features needed to be scaled since they were on vastly different scales, ranging from hundreds of e-mails to millions of dollars.
SelectKBest13 and Principal Components Analysis (PCA14) dimension reduction were then used as part of the GridSearchCV pipeline when searching the estimator parameter space. These two steps were run during each of the cross-validation loops used in the grid search for optimal parameters. The K-best features were selected using the Anova F-value classification scoring function. The resulting K-best features were then fed into PCA dimensionality reduction. Finally, the resultant N principal components were fed into a classification algorithm15.
Classification algorithms were tested since this is a classic binary classification task.
Logistic Regression16 ended up performing the best. Linear Support Vector Classifier (Linear-SVC17), which is similar to a Support Vector Machines with a ‘linear’ kernel, also gave similar scores. Support Vector Machines Classifier (SVM-C18) with ‘rbf’ kernel was the third best performing model tested, and gave fairly good scores as well. Stochastic Gradient Descent19, K-Means Classifiers20, ExtraTrees21 and Random Forests22 were also tested with moderate success.
What does it mean to tune the parameters of an algorithm, and what can happen if you don’t do this well? How did you tune the parameters of your particular algorithm? (Some algorithms don’t have parameters that you need to tune–if this is the case for the one you picked, identify and briefly explain how you would have done it if you used, say, a decision tree classifier).
Algorithms may perform differently using different parameters depending on the structure of your data. If you don’t do this well, you can over-tune an algorithm to predict your training data extremely well, but fail miserably on unseen data. Each algorithm was tuned using an exhaustive grid search over any major tune-able parameters, over 1000 randomized stratified cross-validation stratified splits. The results were scored in each split on the hold-out testing portion, and the score was averaged over all 1000 splits. The parameters which gave the highest average score were selected for the final model.
Major models and parameters tuned over and final parameters found were:
Parameters | Logistic Regression23 | Linear Support Vector Classifier24 | Support Vector Machines - Classifier25 | PCA26 | SelectKBest27 |
---|---|---|---|---|---|
C: Value of the regularization constraint | 1e-3 | 1e-5 | 1 | ||
class_weight: Over-/undersamples samples of each class (inversely proportional to class frequencies) | ‘auto’ | ‘auto’ | ‘auto’ | ||
tol: Tolerance for stopping criteria | 1e-64 | 1e-32 | 1e-3 | ||
n_components: # of components to explain % of variance | 0.5 | ||||
whiten: decorrelation transformation | False | ||||
selection: Number of top features to select. | 17 | ||||
gamma: Kernel coefficient | 0.0 | ||||
kernel: | ‘linear’ | ‘rbf’ |
What is validation, and what’s a classic mistake you can make if you do it wrong? How did you validate your analysis?
Validation is the processed of checking to see how your model performs on unseen data. A classic mistake would be tuning your model be able to predict your training data very well , but then having it perform poorly on unseen out-of-sample testing data. This is called overfitting. One of the major goals in validation is to avoid overfitting, which can be accomplished through a process called cross-validation.
Cross-validation is the process of randomly splitting the data into training and testing data. Then the model can train on the training data, and be validated on the testing data.
The model selection process was validated using 1000 randomized stratified cross-validation splits28 and selecting the parameters which performed best on average over the 1000 splits. A similar validation procedure was used in tester.py to evaluate the resulting final models that were selected.
Since the dataset for this project is so small, a hold-out set was not used. For final model assessment, precision, recall, F1-score and F2-score were averaged over 1000 randomized 90%-training/10%-testing splits to measure out-of-sample accuracy in tester.py.
An explicit hold-out set was not used because with only 144 data points and 18 poi’s, a stratified hold-out set of 20% would leave only around 3 POI points to do a one-time final test on. This would also not give much confidence in the precision of the performance metrics on such a small hold-out set, while also negatively impacting the ability to create the model.
This is further addressed in the context of data-starved predictive modeling by Kuhn and Kjell (2013) and Hawkins et. all (2003).
”[…]when the number of samples is not large, a strong case can be made that a test set should be avoided because every sample may be needed for model building. […] Additionally, the size of the test set may not have sufficient power or precision to make reasonable judgements.” 29
Hawkins et al. (2003) concisely summarize this point:“holdout samples of tolerable size [… ] do not match the cross-validation itself for reliability in assessing model fit and are hard to motivate.” 30
Furthermore, parameters were tuned with a grid-search (GridSearchCV31) over 1000 stratified shuffled cross-validation 90%-training/ 10%-testing splits to emulate the same testing procedure used in tester.py. This is because the KBest selection and PCA reduction were done within each 1000 cross-validation split, instead of outside the selection process. K-best selection and PCA reduction being in the inner loop was done to give a less biased estimate of performance on any new unseen data that this model might be used for.
When the final model was selected, it was fit to the training data and showed that 17 features were selected, then reduced into 2 principal components to be used in the final logistic regression classification model.
These feature might change slightly when fit again each time in the tester.py since the final model is a pipeline which selects the k-best features inside of the pipeline. Below are the final 17 features chosen when the entire dataset was fit to the final chosen model pipeline:
Top 17 Features | ||||
---|---|---|---|---|
salary | to_messages | total_payments | exercised_stock_options | |
bonus | restricted_stock | shared_receipt_with_poi | total_stock_value | |
expenses | loan_advances | other | from_this_person_to_poi | director_fees |
deferred_income | long_term_incentive | from_poi_to_this_person | total_compensation | |
director_fees | ||||
The scoring function for picking the best models/parameters was a mix of maximizing the recall of the models searched over, while keeping the precision at or above 0.3.
For our model, accuracy would be a sub-optimal evaluation metric due to the sparsity of POI’s being predicted. If we just guessed ‘Not a POI’ for everyone, we would attain 87.5% accuracy while not finding any perpetrators of fraud. For evaluating our model, we will be using both precision and recall.
Precision can be thought of as the ratio of how often your model is actually correct in identifying a positive label to the total times it guesses a positive label. A higher precision score would mean less false positives.
Precision might be important in giving true customers discounts with frequent buyer status at checkout time. We would want to make sure all true customers get discounts smoothly to ensure customer loyalty, even if we get some some false positives and give discounts to some people who don’t have frequent buyer status.
In our case, if we were using the model to judge whether or not to investigate someone as a possible person of interest, it would be how often the people we chose to investigate turned out to really be persons of interest.
Recall can be thought of as the ratio of how often your model correctly identifies a label as positive to how many total positive labels there actually are. A higher recall score would mean less false negatives.
Recall might be more important if used in security credential scanning; we would like to make sure no unauthorized people get into a secured facility even if we have to reject authorized credentials a few extra times and have people re-scan their credentials.
In our case, if we were use the model to decide whether or not to investigate someone as a possible person of interest, it would be how many persons of interest did we identify out of the total amount of persons of interest that there were.
For this project, I could argue that in the context of searching for perpetrators of fraud in one of the largest cases of corporate fraud, recall is the most important criteria of the two. We would like to find all people who were involved, even if it means we have to investigate and clear more extra innocent people.
With the previous thoughts about the importance of recall in mind, the models were optimized toward higher recall, while maintaining a 0.3 precision threshold.
Model | GridSearch Recall Est. | Recall | Precision | KBest Features | PCA components | Class Weight |
---|---|---|---|---|---|---|
Logistic Regression | 0.9215 | 0.92700 | 0.30640 | 17 | 2 | ‘auto’ |
Linear SVC | 0.8935 | 0.88750 | 0.29657 | 17 | 2 | ‘auto’ |
SVM-C | 0.8255 | 0.83050 | 0.29269 | 17 | 2 | ‘auto’ |
For our models, we attained pretty high recall, ensuring that for the most part, we were able to identify POI’s that existed. Our precision was at 0.3, showing that we had a decent amount of false positives and investigated a good portion of innocent people in our quest for finding POI’s.
An interesting observation was that tuning class weights played the biggest role in being able to tune the model to give better recall, precision, f1 (or roc-auc) depending on the which is more desirable. For the final models, ‘auto’ was used which ‘automatically adjust weights inversely proportional to class frequencies.’34
This made sense since were so few POI’s, they would need to given more weight to help prevent missing predicting them (false negatives.)
I would like to have also found a more intelligent way to add new features, and prune them back than the univariate K-best selection process. Perhaps removing correlated features, based on correlations with each other as well as all other features in the dataset.
Some custom scoring functions were tried with minimal success (RandomForest as a custom KBest scorer) while increasing run-time. This might need to be explored further as well to get better performance.
This dataset presented interesting challenges for dealing with smaller, but still complex, datasets. The data still needed to cleaned, and intelligently approached to create informative models. Cross-validation techniques and workflow pipelines were paramount in creating reliable predictive models.
I thoroughly enjoyed exploring some of the machine learning techniques and data workflows required for data analysis in Python!
Source Code | ||||
---|---|---|---|---|
poi_id.py | Train POI classification models using helper modules. | |||
poi_add_features.py | Library for creating features to be used in creating fraud person-of-interest(POI) prediction model. | |||
poi_data.py | Library for shaping the data to be to used in creating fraud person-of-interest (POI) prediction model. | |||
poi_model.py | Library for returning sk-learn pipelines and parameters for use in predictive model building. | |||
tools/tester.py | Basic script for importing student’s POI identifier, and checking the results that they get from it. | |||
tools/feature_format.py | A general tool for converting data from the dictionary format to an (n x k) python list that’s ready for training an sklearn algorithm. |