Miscellaneous Explorations Applying Machine Learning to Heavily Processed Tabular Datasets - Anomaly Detection
Tabular datasets are an important format to consider. Though anomaly detection could be a highly interesting topic, the dataset to which this post pertains is so significantly curated that it's hard to build intuition around the anomaly detection topic area. Instead, this is a proof of concept around fastai v2. Potentially the most interesting result is that the out of the box decision tree (sklearn CART implementation) worked almost flawlessly, while the fastai `tabular_learner` struggled. This will be an area of expansion to match performance between the methods.
- Introduction
- Code
- Dataset configuration
- Training / Testing Splits
- Keeping Track of Meaning with Categorical Targets
- First model - Simple and Easy to Introspect
- Success Metrics / Loss / Error Estimation
- Train a full model (single CART tree)
- Train a Random Forest with 4 parameters
- Feature Importance from trees in the Random Forest
- Examining Row-Level Feature Contributions
- Examining Performance of Neural Networks
- Conclusions
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
%matplotlib inline
from random import shuffle
from fastbook import *
from kaggle import api
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
import pandas as pd
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
Introduction
Objective
The objective is to model the KDD'99 dataset an anomaly detection dataset.
Dataset
KDD'99 is a typical old school dataset. One surprising aspect was that there were no timestamps or system identifiers. This made it difficult to see how to associate the rows with each other in any meaningul way. The task is to assume that each observation is independent and identically distributed.
The dataset contains 31 features including the target
This study leaves for future an evaluation of whether the kaggle KDD99 dataset is the same as the original KDD99 dataset.
Benchmarks on KDD'99 (Current SOA: marked with *)
Papers with code shows the expected outcome metrics for a variety of learning methods.
Findings
- RF and Ensemble performs best according to the paper.
- Sklearn implements CART and it beats all the benchmarks in the paper.
Citations
This notebook is a Frankenstien creation from me adapting the FastAI Tabular Data Toolset, 2020 to a diferent dataset than was used in the tutorial. The significant differences between my notebook and the FastAI tutorial are:
- Adaptation of FastAI to KDD'99
- Example of using FastAI 2 for a canonical classification problem
- The deep dive into contributions profiling found here. This will be encorporated into the prcvd.ai core library.
- Removal of toy/tutorial content from the FastAI such as the look into OOB errors and the discussion of neural networks for continuous features.
Code
Dataset configuration
We sourced the dataset by downloading it manually from kaggle. The configuration parameters below were also extracted manually from the accompanying kaggle description. Of particular interest was to verify the categorical and numerical features separately.
data_dir = Path('/Users/home0/data/anomaly-detect/archive/')
train_ds = data_dir/'Train.txt'
bin_names = ['land', 'logged_in', 'root_shell', 'su_attempted', 'is_host_login','is_guest_login']
cat_str_names = ['protocol_type', 'service', 'flag']
cat_names = bin_names + cat_str_names
num_names = ['duration','src_bytes','dst_bytes','wrong_fragment',
'urgent','hot','num_failed_logins','num_compromised',
'num_root','num_file_creations','num_shells',
'num_access_files','num_outbound_cmds','count',
'srv_count','serror_rate','srv_serror_rate',
'rerror_rate','srv_rerror_rate','same_srv_rate',
'diff_srv_rate','srv_diff_host_rate','dst_host_count',
'dst_host_srv_count','dst_host_same_srv_rate',
'dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_serror_rate','dst_host_srv_serror_rate',
'dst_host_rerror_rate','dst_host_srv_rerror_rate']
procs = [Categorify, FillMissing, Normalize]
dep_var = 'attack'
header = [
"duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land",
"wrong_fragment","urgent","hot","num_failed_logins","logged_in",
"num_compromised","root_shell","su_attempted","num_root",
"num_file_creations",
"num_shells","num_access_files","num_outbound_cmds",
"is_host_login",
"is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate",
"rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate",
"srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate",
"dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","attack",
"last_flag"]
df = pd.read_csv(train_ds, header=None)
df.columns = header
Looking at unique values in each column
Even though the dataset specification provided by the creators of KDD'99 specify that some features are numerical, they don't have the same variation that one might expect with true continuous features. Henceforth, we treat those features with a numerical type as continuous features while the boolean or string type fields are treated as categoricals. As we know, a truly continously distributed random variable will theoretically never be the same. Some common reasons why numerical features are not truly continuous are:
- the device sampling has low sensitivity
- the ETL/ELT processes between the measurement device and us are losing significant digits.
This is helpful to understand because it stands to reason that some types of attacks will be only visible if the data is highly expressive.
print('Dataset Shape', df.shape)
[(col,len(df[col].drop_duplicates()))for col in df.columns]
Training / Testing Splits
Unlike many tabular datasets that have primary keys to distinguish rows from each other, this dataset has none. That is almost certainly because the dataset has been anonymized and scrubbed. There is nothing we we care to do about that, assuming that the dataset has been anonymized correctly. So, for the training - testing split, we simply take a random selection of rows. Like most of my experiments, the train_perc
(the percentage of the dataset used in training, where 1-train_perc
is the percentage of the dataset used for OOB testing) is a hyperparameter in the sense that we may choose to vary it and see how that effects the modeling outcomes.
from random import shuffle
shuf = list(df.index)
shuffle(shuf)
train_perc = 0.90
train_ct = int(len(shuf)*train_perc)
splits = (shuf[:train_ct], shuf[train_ct:])
cont = num_names
cat = cat_names
procs = [Categorify, FillMissing]
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
xs, y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
Keeping Track of Meaning with Categorical Targets
The datablock API will want to convert all categorical variables to integers, but when we go to try to understand and explain how the model is working, those meanings get lost. The object name_to_id
is used to keep track of the mapping between the way the models see the data and the way we understand the data. Potentially, the most important detail I learned in this exercise was how the FastAI datablock API keeps track of the mapping between categorical target names (the language strings that describe your classes) and the numeric (int
) representation that is used by the models. This will become important later when we start trying to understand how each column contributes to the prediction. This is not covered in the FastAI tutorial on tabular data because in that case, they examine regression models which have continuous targets. In this case, we have a categorical target.
name_to_id = {name: i for i,name in enumerate(to.categorize.vocab)}
name_to_id
Target Names
It is not clear to me at the time of writing this that this is the correct way to map the target class names to the ids created by the Data Block API. However, it is my best guess. Link to forum topic here
m = DecisionTreeClassifier(max_leaf_nodes=4)
m.fit(xs, y);
draw_tree(m, xs, size=14, leaves_parallel=True, precision=2)
Success Metrics / Loss / Error Estimation
Here, we're looking at the extent to which we classify correctly compared to the total observations in the test set. Note, they are not root mean squared error. I left the functions the same name as in the fastai tutorial for easy of copy/pasting code between notebooks.
def r_mse(pred, y):
out = sum(np.equal(pred, y)) / len(y)
return out
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
m = DecisionTreeClassifier()
m.fit(xs, y);
m_rmse(m, xs, y.values), m_rmse(m, valid_xs, valid_y)
This is extremely high model performance from a single tree. It already exceeds the state of the art that is from the Papers With Code Article, although in the same ballpark.
Train a Random Forest with 4 parameters
The model performs really well. Even if we constrain it to having no singleton leaves. The reason we might want to increase MIN_SAMPLES_PER_LEAF
is that we associate stability or generalizability of the solution with the number of leaves. If we have only leaves that have multiple instances in the training set, then we expect those are repeatable out of training sample. Even constraining each leaf to need at least 50 instances we see that the model accurately classifies >99% of the testing instances.
N_ESTIMATORS=40
MAX_SAMPLES=100_000
MAX_FEATURES=10
MIN_SAMPLES_LEAF=150
def rf(xs, y,
n_estimators=N_ESTIMATORS,
max_samples=MAX_SAMPLES,
max_features=MAX_FEATURES,
min_samples_leaf=MIN_SAMPLES_LEAF,
**kwargs):
model = RandomForestClassifier(
n_jobs=-1,
n_estimators=n_estimators,
max_samples=max_samples,
max_features=max_features,
min_samples_leaf=min_samples_leaf,
oob_score=True)
return model.fit(xs, y)
MIN_SAMPLES_LEAFS=[1, 10, 100, 200, 400, 800, 1000]
for ms in MIN_SAMPLES_LEAFS:
m = rf(xs, y, min_samples_leaf=ms);
print(
'min_samples_per_leaf={}, train error={}, valid error={}'
.format(ms,m_rmse(m, xs, y),m_rmse(m, valid_xs, valid_y))
)
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds_std = preds.std(0)
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(100)]);
Feature Importance from trees in the Random Forest
Feature importance measures the effectiveness of the split chosen by the CART algorithm based on the data.
Feature importance is calculated as the decrease in node impurity weighted by the probability of reaching that node. The node probability can be calculated by the number of samples that reach the node, divided by the total number of samples. The higher the value the more important the feature.
Stacey Ronaghan - paywalled Below, we examine the feature importance using the various approaches proposed in the FastAI course. Below is a
def rf_feat_importance(m, df):
return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, xs)
fi[:10]
def plot_fi(fi):
return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(fi[:30]);
Removing Features
Using feature importance, it is possible to discover which features can be removed without significant loss of model effectiveness. Using feature importance to assign priority to each feature, we can selectively remove the low importance features individually and retrain the model to estimate the impact.
to_keep = fi[fi.imp>0.005].cols
len(to_keep)
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]
m = rf(xs_imp, y)
m1 = rf(xs, y, min_samples_leaf=10)
m2 = rf(xs_imp, y, min_samples_leaf=10)
print(
'Original feature set\nfeature_count={}, train error={:.4f}, valid error={:.4f}'
.format(len(xs.columns),
m_rmse(m1, xs, y),
m_rmse(m1, valid_xs, valid_y)
)
)
print(
'Reduced feature set\nfeature_count={}, train error={:.4f}, valid error={:.4f}'
.format(len(xs_imp.columns),
m_rmse(m2, xs_imp, y),
m_rmse(m2, valid_xs_imp, valid_y)
)
)
print('Impact={:.8f}({:.3f}%)'.format(
m_rmse(m1, valid_xs, valid_y)-m_rmse(m2, valid_xs_imp, valid_y),
100*(m_rmse(m2, valid_xs_imp, valid_y) / m_rmse(m1, valid_xs, valid_y) -1)
)
)
plot_fi(rf_feat_importance(m, xs_imp));
cluster_columns(xs_imp)
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)
m = rf(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)
p = valid_xs_final['count'].hist(density=True)
p = valid_xs_final['src_bytes'].hist(density=True)
from sklearn.inspection import plot_partial_dependence
for target, target_id in name_to_id.items():
fig,ax = plt.subplots(figsize=(12, 4))
plt.title(label='Target name: {}'.format(target))
plot_partial_dependence(m, valid_xs_final, ['count','src_bytes'],
target=target_id, grid_resolution=20, ax=ax);
Examining Row-Level Feature Contributions
Examining the measures of contributions reveals how the random forest is determining the prediction for each instance in the training set. I had used treeinterpreter
in the past but not waterfall_chart
. These charts are nice, especially if shown in an array. We can, for example compare a correctly predicted row waterfall chart to an incorrectly predicted row outcome to reveal differences in that row that caused the false positive. Below that is shown.
import warnings
warnings.simplefilter('ignore', FutureWarning)
from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall
class TargetProfiler:
"""Profiler for a single target value of a multiple class dep_var"""
def __init__(self, est, valid_X, valid_y, dep_var,
target, name_to_id, n_components):
self.n_components = n_components
self.est = est
self.valid_X = valid_X
self.valid_y = valid_y
self.dep_var = dep_var
self.target = target
self.name_to_id = name_to_id
self.id_to_name = {idx: name for name, idx in self.name_to_id.items()}
self.subset_mask = (self.valid_y == self.name_to_id[self.target])
self.subset_xs = self.valid_X[self.subset_mask]
self.subset_ys = self.valid_y[self.subset_mask]
prediction, bias, contributions = treeinterpreter.predict(
model=self.est, X=self.subset_xs.values
)
self.data = {
i: {
'preds': prediction[i],
'bias': bias[i],
'contribs': contributions[i]}
for i in range(prediction.shape[0])
}
self.pred_ans = [(row == row.max()).nonzero()[0][0] for row in prediction]
self.confusion_arr = [id_to_name[ans] for ans in self.pred_ans]
self.hits = []
self.misses = []
for i, elem in enumerate(self.confusion_arr):
if elem==self.target:
self.hits.append(i)
else:
self.misses.append(i)
self.hit_contribs_profiles = []
self.miss_contribs_profiles = []
self.contrib_profile_summary = {
'hits': {},
'misses': {}
}
self._create_contrib_profiles()
def _create_contrib_profiles(self):
"""Creates the contribution objects and summary"""
for idx in self.hits:
self.hit_contribs_profiles.append(
self.get_top_n_contributions(idx=idx)
)
for idx in self.misses:
self.miss_contribs_profiles.append(
self.get_top_n_contributions(idx=idx)
)
for profile in self.hit_contribs_profiles:
if profile not in self.contrib_profile_summary['hits']:
self.contrib_profile_summary['hits'][profile] = 0
self.contrib_profile_summary['hits'][profile] += 1
self.contrib_profile_summary['hits'] = OrderedDict(
{k: v
for k, v
in sorted(
self.contrib_profile_summary['hits'].items(),
key=lambda item: item[1],
reverse=True)
}
)
for profile in self.miss_contribs_profiles:
if profile not in self.contrib_profile_summary['misses']:
self.contrib_profile_summary['misses'][profile] = 0
self.contrib_profile_summary['misses'][profile] += 1
self.contrib_profile_summary['misses'] = OrderedDict(
{k: v
for k, v
in sorted(
self.contrib_profile_summary['misses'].items(),
key=lambda item: item[1],
reverse=True)
}
)
def print_summary(self):
print('There are {} TP predictions on which contributions are measured'.format(len(self.hit_contribs_profiles)))
print('There are {} unique TP contribution profiles considering the top {} feature contributions'.format(len(set(self.hit_contribs_profiles)), self.n_components))
print('There are {} FP predictions on which contributions are measured'.format(len(self.miss_contribs_profiles)))
print('There are {} unique FP contribution profiles considering the top {} feature contributions'.format(len(set(self.miss_contribs_profiles)), self.n_components))
def get_top_n_contributions(self, idx, n=None):
if not n:
n = self.n_components
data = self.get_instance_contributions(idx=idx).tolist()
df = pd.DataFrame(data, index=self.subset_xs.columns)
df['abs'] = df[0].apply(abs)
df.sort_values(by='abs', inplace=True, ascending=False)
return tuple([(k,v) for k,v in df.iloc[:n][0].to_dict().items()])
def get_instance_contributions(self, idx):
"""
Finds the contribution array for the max probability
class for an instance row
"""
choice = name_to_id[self.confusion_arr[idx]]
choice_contributions = self.data[idx]['contribs'][:, choice]
return choice_contributions
def get_chart_config(self, idx, num_stds=0):
"""
Produces the variables necessary for the waterfall chart
"""
index=self.subset_xs.columns
data = self.get_instance_contributions(idx=idx)
threshold = self.get_instance_contributions(idx=idx).mean()+num_stds*self.get_instance_contributions(idx=idx).std()
Title =' '.join([
'instance id = {} in target="{}" predicted as "{}", (Pr={:.2f})',
'\nFiltering contributions < mu(contributions)+{}*sd(contributions)'
]) \
.format(idx, self.target, self.confusion_arr[idx], self.data[idx]['preds'].max(), num_stds)
sorted_value = True
net_label = 'Total'
rotation_value = 90
formatting = '{:,.3f}'
return index,data,rotation_value,threshold,sorted_value,net_label,formatting,Title
profiler = TargetProfiler(
est=m, dep_var=dep_var, name_to_id=name_to_id,
valid_X=valid_xs_final, valid_y=valid_y, target='ipsweep',
n_components=1
)
index,data,rotation_value,threshold,sorted_value,net_label,formatting,Title = \
profiler.get_chart_config(idx=0, num_stds=0)
waterfall(
index=index,
data=data,
rotation_value=rotation_value,
threshold=threshold,
sorted_value=sorted_value,
net_label=net_label,
formatting=formatting,
Title=Title
);
profiler.print_summary()
profiler.contrib_profile_summary
Summarizing the Contribution Analysis
Note: values may differ slightly in reproductions
The illustration above, including code, enables several useful lines of questioning that help improve understanding around why the model predicts what it predicts. In particular, it allows someone to examine the contributions for a specific target value (a label in a multi-label problem). We introduce the concept of a "contribution profile", which is the array of contribution metrics for every feature in the model as an estimate of the latent metric space that best describes the relationship between the independent and dependent variable in the Random Forest model.
For example in the case above it is to describe the predictions on attack='back'
. We show a simplified case of defining the contribution profile as the single maximum contributing feature and its contribution value. In all cases of where the result was a true positive, the feature with largest contributions was src_bytes
, and there were 11 different contribution values spanning from around 0.45 to 0.76. Looking at the object called profiler.contrib_profile_summary
we note that 70 of the hits shared the same contribution of the src_bytes
feature, which was a value of 0.7610 Additionally, we can visualize the contributions with a waterfall plot for examples of TPs and FPs, with various characteristics.
Below, we look at the waterfall chart of the first of 100 (out of 103 total) hit in the dataset where attack=='back'
, which is found by specifying idx=profiler.hits[0]idx=profiler.hits[0]
. We can see that the contribution of src_bytes == 0.761
and that means that it is one in the 70 common contribution profiles. Below that, we can look at the first of 3 misses in the dataset, which is specified by setting idx=profiler.misses[0]idx=profiler.misses[0]
. The miss examined shows quite a different contribution profile, which can be noticed in the chart title where measure of model confidence is lower at 0.80 vs 0.97 in the hit. It can also be noticed because using the same num_stds
filter yields a much more complicated profile with 14 features in it compared to the 4 feature profile of the hit. Concluding that we can eliminate quite a bit of the false positives in the prediction with a confidence based filter.
index,data,rotation_value,threshold,sorted_value,net_label,formatting,Title = \
profiler.get_chart_config(idx=profiler.hits[0], num_stds=0)
waterfall(
index=index,
data=data,
rotation_value=rotation_value,
threshold=threshold,
sorted_value=sorted_value,
net_label=net_label,
formatting=formatting,
Title=Title
);
index,data,rotation_value,threshold,sorted_value,net_label,formatting,Title = \
profiler.get_chart_config(idx=profiler.misses[0], num_stds=0)
waterfall(
index=index,
data=data,
rotation_value=rotation_value,
threshold=threshold,
sorted_value=sorted_value,
net_label=net_label,
formatting=formatting,
Title=Title
);
from fastai.tabular.all import *
dls = to.dataloaders(1024)
learn = tabular_learner(dls, layers=[500,250], n_out=len(y.unique()))
learn.lr_find()
learn.fit_one_cycle(20, 10E-1)
preds,targs = learn.get_preds()
chk = np.equal(preds.argmax(dim=1).unsqueeze(1), targs).squeeze(-1)
print('Accuracy:', chk.sum() / (chk.shape[0]*1.0))
Conclusions
The KDD'99 dataset is particularly well suited to more traditional modeling techniques like decision trees and random forests. At least the basic attempt at fitting a neural network failed to produce an effective model, let alone a rockstart model such as that which was easily produced with the random forest. Likely, the KDD'99 dataset is massively cleaned and processed, which is why the decision trees work so well.
Future Work
In the future, it would be interesting to examine the extent to which neural networks can be coaxed into performing better. Future work will likely involve finding a dataset that is a little more realistic.