Pre-processing on NSL_KDD dataset

arezoo_moradi · May 13, 2020, 8:55am

I want to load the NSL_KDD dataset contained in this link with using the Python programming.

smellslikeml/deepIDS/blob/master/deep_IDS.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Network Intrusion Detection with Deep Learning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# For a broad introduction to the problem and dataset: https://arxiv.org/pdf/1701.02145.pdf\n",
    "# For modern results using deep learning: http://ieeexplore.ieee.org/document/7777224/"
   ]

This file has been truncated. show original

In this database, 22 features for training and testing data are classified into 5 separate classes(Normal, DOS, U2R, R2L, Probe)
But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and crrocet result), it using for the test data.
with open(‘G:/RUN_PYTHON/kddcup.names.txt’, ‘r’) as infile:
kdd_names = infile.readlines()
kdd_cols = [x.split(’:’)[0] for x in kdd_names[1:]]

The Train+/Test+ datasets include sample difficulty rating and the attack class

kdd_cols += [‘class’, ‘difficulty’]

kdd = pd.read_csv(‘G:/RUN_PYTHON/KDDTrain+.txt’, names=kdd_cols)
kdd_t = pd.read_csv(‘G:/RUN_PYTHON/KDDTest+.txt’, names=kdd_cols)
#kdd = pd.read_csv(‘G:/RUN_PYTHON/kddcup.txt.data_10_percent_corrected’, names=kdd_cols)
#kdd_t = pd.read_csv(‘G:/RUN_PYTHON/kddcup.testdata.unlabeled_10_percent’, names=kdd_cols)

Consult the linked references for attack categories:

https://www.researchgate.net/post/What_are_the_attack_types_in_the_NSL-KDD_TEST_set_For_example_processtable_is_a_attack_type_in_test_set_Im_wondering_is_it_prob_DoS_R2L_U2R

The traffic can be grouped into 5 categories: Normal, DOS, U2R, R2L, Probe

or more coarsely into Normal vs Anomalous for the binary classification task

kdd_cols = [kdd.columns[0]] + sorted(list(set(kdd.protocol_type.values))) + sorted(list(set(kdd.service.values))) + sorted(list(set(kdd.flag.values))) + kdd.columns[4:].tolist()
attack_map = [x.strip().split() for x in open(‘G:/RUN_PYTHON/training_attack_types.txt’, ‘r’)]
attack_map = {x[0]: x[1] for x in attack_map if x}

Here we opt for the 5-class problem

kdd[‘class’] = kdd[‘class’].replace(attack_map)
kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map)

def cat_encode(df, col):
return pd.concat([df.drop(col, axis=1), pd.get_dummies(df[col].values)], axis=1)

def log_trns(df, col):
return df[col].apply(np.log1p)

cat_lst = [‘protocol_type’, ‘service’, ‘flag’]
for col in cat_lst:
kdd = cat_encode(kdd, col)
kdd_t = cat_encode(kdd_t, col)

log_lst = [‘duration’, ‘src_bytes’, ‘dst_bytes’]
for col in log_lst:
kdd[col] = log_trns(kdd, col)
kdd_t[col] = log_trns(kdd_t, col)

kdd = kdd[kdd_cols]
for col in kdd_cols:
if col not in kdd_t.columns:
kdd_t[col] = 0
kdd_t = kdd_t[kdd_cols]

Now we have used one-hot encoding and log scaling

difficulty = kdd.pop(‘difficulty’)
target = kdd.pop(‘class’)
y_diff = kdd_t.pop(‘difficulty’)
y_test = kdd_t.pop(‘class’)

target = pd.get_dummies(target)
print(target)
y_test = pd.get_dummies(y_test)
print(y_test)

the output of target:
Out[27]:
dos normal probe r2l u2r
0 0 1 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 1 0 0 0 0
6 1 0 0 0 0
7 1 0 0 0 0
8 1 0 0 0 0
9 1 0 0 0 0
10 1 0 0 0 0
11 1 0 0 0 0
12 0 1 0 0 0
13 0 0 0 1 0
14 1 0 0 0 0
15 1 0 0 0 0
16 0 1 0 0 0
17 0 0 1 0 0
18 0 1 0 0 0
19 0 1 0 0 0
20 1 0 0 0 0
21 1 0 0 0 0
22 0 1 0 0 0
23 0 1 0 0 0
24 1 0 0 0 0
25 0 1 0 0 0
26 1 0 0 0 0
27 0 1 0 0 0
28 0 1 0 0 0
29 0 1 0 0 0
… … … … …
125943 0 1 0 0 0
125944 0 1 0 0 0
125945 0 1 0 0 0
125946 1 0 0 0 0
125947 0 0 1 0 0
125948 1 0 0 0 0
125949 0 1 0 0 0
125950 1 0 0 0 0
125951 0 1 0 0 0
125952 0 1 0 0 0
125953 1 0 0 0 0
125954 0 1 0 0 0
125955 0 1 0 0 0
125956 0 1 0 0 0
125957 0 1 0 0 0
125958 1 0 0 0 0
125959 0 1 0 0 0
125960 0 1 0 0 0
125961 0 1 0 0 0
125962 0 1 0 0 0
125963 0 1 0 0 0
125964 1 0 0 0 0
125965 0 1 0 0 0
125966 1 0 0 0 0
125967 0 1 0 0 0
125968 1 0 0 0 0
125969 0 1 0 0 0
125970 0 1 0 0 0
125971 1 0 0 0 0
125972 0 1 0 0 0

[125973 rows x 5 columns]

the output of y_test:
print(y_test)
apache2 dos httptunnel mailbomb mscan named normal probe
0 0 1 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 1 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0
5 0 0 0 0 0 0 1 0
6 0 0 0 0 0 0 1 0
7 0 0 0 0 0 0 0 0
8 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 0 1 0
12 0 1 0 0 0 0 0 0
13 0 1 0 0 0 0 0 0
14 0 0 0 0 0 0 1 0
15 0 0 0 0 0 0 1 0
16 0 0 0 0 0 0 1 0
17 0 0 0 0 0 0 1 0
18 0 0 0 0 0 0 1 0
19 0 1 0 0 0 0 0 0
20 0 1 0 0 0 0 0 0
21 0 0 0 0 1 0 0 0
22 0 0 0 0 0 0 1 0
23 0 0 0 0 0 0 1 0
24 0 1 0 0 0 0 0 0
25 0 1 0 0 0 0 0 0
26 0 0 0 0 0 0 1 0
27 0 0 0 0 0 0 1 0
28 0 1 0 0 0 0 0 0
29 0 0 0 0 0 0 1 0
… … … … … … … …
22514 0 0 0 0 0 0 1 0
22515 1 0 0 0 0 0 0 0
22516 0 0 0 0 0 0 1 0
22517 0 0 0 0 0 0 0 0
22518 0 0 0 0 0 0 1 0
22519 0 0 0 0 0 0 0 0
22520 0 0 0 0 0 0 0 1
22521 0 0 0 0 0 0 0 1
22522 0 1 0 0 0 0 0 0
22523 0 0 0 0 0 0 1 0
22524 0 0 0 0 0 0 0 0
22525 1 0 0 0 0 0 0 0
22526 0 0 0 0 0 0 1 0
22527 0 0 0 0 0 0 1 0
22528 0 1 0 0 0 0 0 0
22529 0 0 0 0 0 0 1 0
22530 0 1 0 0 0 0 0 0
22531 0 1 0 0 0 0 0 0
22532 0 0 0 0 0 0 1 0
22533 0 0 0 0 0 0 1 0
22534 0 1 0 0 0 0 0 0
22535 0 0 0 0 0 0 1 0
22536 0 1 0 0 0 0 0 0
22537 0 0 0 1 0 0 0 0
22538 0 1 0 0 0 0 0 0
22539 0 0 0 0 0 0 1 0
22540 0 0 0 0 0 0 1 0
22541 0 1 0 0 0 0 0 0
22542 0 0 0 0 0 0 1 0
22543 0 0 0 0 1 0 0 0

   processtable  ps  ...    sendmail  snmpgetattack  snmpguess  sqlattack  \

0 0 0 … 0 0 0 0
1 0 0 … 0 0 0 0
2 0 0 … 0 0 0 0
3 0 0 … 0 0 0 0
4 0 0 … 0 0 0 0
5 0 0 … 0 0 0 0
6 0 0 … 0 0 0 0
7 0 0 … 0 0 0 0
8 0 0 … 0 0 0 0
9 0 0 … 0 0 0 0
10 0 0 … 0 0 0 0
11 0 0 … 0 0 0 0
12 0 0 … 0 0 0 0
13 0 0 … 0 0 0 0
14 0 0 … 0 0 0 0
15 0 0 … 0 0 0 0
16 0 0 … 0 0 0 0
17 0 0 … 0 0 0 0
18 0 0 … 0 0 0 0
19 0 0 … 0 0 0 0
20 0 0 … 0 0 0 0
21 0 0 … 0 0 0 0
22 0 0 … 0 0 0 0
23 0 0 … 0 0 0 0
24 0 0 … 0 0 0 0
25 0 0 … 0 0 0 0
26 0 0 … 0 0 0 0
27 0 0 … 0 0 0 0
28 0 0 … 0 0 0 0
29 0 0 … 0 0 0 0
… … … … … … …
22514 0 0 … 0 0 0 0
22515 0 0 … 0 0 0 0
22516 0 0 … 0 0 0 0
22517 1 0 … 0 0 0 0
22518 0 0 … 0 0 0 0
22519 1 0 … 0 0 0 0
22520 0 0 … 0 0 0 0
22521 0 0 … 0 0 0 0
22522 0 0 … 0 0 0 0
22523 0 0 … 0 0 0 0
22524 0 0 … 0 0 0 0
22525 0 0 … 0 0 0 0
22526 0 0 … 0 0 0 0
22527 0 0 … 0 0 0 0
22528 0 0 … 0 0 0 0
22529 0 0 … 0 0 0 0
22530 0 0 … 0 0 0 0
22531 0 0 … 0 0 0 0
22532 0 0 … 0 0 0 0
22533 0 0 … 0 0 0 0
22534 0 0 … 0 0 0 0
22535 0 0 … 0 0 0 0
22536 0 0 … 0 0 0 0
22537 0 0 … 0 0 0 0
22538 0 0 … 0 0 0 0
22539 0 0 … 0 0 0 0
22540 0 0 … 0 0 0 0
22541 0 0 … 0 0 0 0
22542 0 0 … 0 0 0 0
22543 0 0 … 0 0 0 0

   u2r  udpstorm  worm  xlock  xsnoop  xterm

0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 0 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 0 0 0 0
9 0 0 0 0 0 0
10 0 0 0 0 0 0
11 0 0 0 0 0 0
12 0 0 0 0 0 0
13 0 0 0 0 0 0
14 0 0 0 0 0 0
15 0 0 0 0 0 0
16 0 0 0 0 0 0
17 0 0 0 0 0 0
18 0 0 0 0 0 0
19 0 0 0 0 0 0
20 0 0 0 0 0 0
21 0 0 0 0 0 0
22 0 0 0 0 0 0
23 0 0 0 0 0 0
24 0 0 0 0 0 0
25 0 0 0 0 0 0
26 0 0 0 0 0 0
27 0 0 0 0 0 0
28 0 0 0 0 0 0
29 0 0 0 0 0 0
… … … … … …
22514 0 0 0 0 0 0
22515 0 0 0 0 0 0
22516 0 0 0 0 0 0
22517 0 0 0 0 0 0
22518 0 0 0 0 0 0
22519 0 0 0 0 0 0
22520 0 0 0 0 0 0
22521 0 0 0 0 0 0
22522 0 0 0 0 0 0
22523 0 0 0 0 0 0
22524 1 0 0 0 0 0
22525 0 0 0 0 0 0
22526 0 0 0 0 0 0
22527 0 0 0 0 0 0
22528 0 0 0 0 0 0
22529 0 0 0 0 0 0
22530 0 0 0 0 0 0
22531 0 0 0 0 0 0
22532 0 0 0 0 0 0
22533 0 0 0 0 0 0
22534 0 0 0 0 0 0
22535 0 0 0 0 0 0
22536 0 0 0 0 0 0
22537 0 0 0 0 0 0
22538 0 0 0 0 0 0
22539 0 0 0 0 0 0
22540 0 0 0 0 0 0
22541 0 0 0 0 0 0
22542 0 0 0 0 0 0
22543 0 0 0 0 0 0

[22544 rows x 22 columns]
best regards

ptrblck · May 14, 2020, 4:47am

It seems your attack_map doesn’t contain all necessary classes to replace all targets from the test set.
The attack_map contains the replacement map for:

{'back': 'dos',
 'buffer_overflow': 'u2r',
 'ftp_write': 'r2l',
 'guess_passwd': 'r2l',
 'imap': 'r2l',
 'ipsweep': 'probe',
 'land': 'dos',
 'loadmodule': 'u2r',
 'multihop': 'r2l',
 'neptune': 'dos',
 'nmap': 'probe',
 'perl': 'u2r',
 'phf': 'r2l',
 'pod': 'dos',
 'portsweep': 'probe',
 'rootkit': 'u2r',
 'satan': 'probe',
 'smurf': 'dos',
 'spy': 'r2l',
 'teardrop': 'dos',
 'warezclient': 'r2l',
 'warezmaster': 'r2l'}

which should work for the training set, which contains kdd['class'].unique():

 array(['normal', 'neptune', 'warezclient', 'ipsweep', 'portsweep',
       'teardrop', 'nmap', 'satan', 'smurf', 'pod', 'back',
       'guess_passwd', 'ftp_write', 'multihop', 'rootkit',
       'buffer_overflow', 'imap', 'warezmaster', 'phf', 'land',
       'loadmodule', 'spy', 'perl'], dtype=object)

such that after calling kdd['class'] = kdd['class'].replace(attack_map) it will contain:

array(['normal', 'dos', 'r2l', 'probe', 'u2r'], dtype=object)

However, the test set seems to contain more unique classes:

array(['neptune', 'normal', 'saint', 'mscan', 'guess_passwd', 'smurf',
       'apache2', 'satan', 'buffer_overflow', 'back', 'warezmaster',
       'snmpgetattack', 'processtable', 'pod', 'httptunnel', 'nmap', 'ps',
       'snmpguess', 'ipsweep', 'mailbomb', 'portsweep', 'multihop',
       'named', 'sendmail', 'loadmodule', 'xterm', 'worm', 'teardrop',
       'rootkit', 'xlock', 'perl', 'land', 'xsnoop', 'sqlattack',
       'ftp_write', 'imap', 'udpstorm', 'phf'], dtype=object)

so you end up with:

kdd_t['class'] = kdd_t['class'].replace(attack_map)
print(kdd_y['class'].unique())
> ['dos' 'normal' 'saint' 'mscan' 'r2l' 'apache2' 'probe' 'u2r'
 'snmpgetattack' 'processtable' 'httptunnel' 'ps' 'snmpguess' 'mailbomb'
 'named' 'sendmail' 'xterm' 'worm' 'xlock' 'xsnoop' 'sqlattack' 'udpstorm']

PS: It’s easier to debug, if you wrap the code into three backticks ```

arezoo_moradi · May 14, 2020, 7:41am

Hi @ptrblck, yes you say right. Thanks for your response.

when I am a newbie in python programming and I want to load the data according to the table of the article but I don’t know how to can do categorical training and testing the NSL_KDD dataset into(‘normal’, ‘dos’, ‘r2l’, ‘probe’, ‘u2r’).

I changed this line of code kdd_t[‘class’] = kdd_t[‘class’].replace(attack_map) to kdd_t[‘class’] = kdd[‘class’].replace(attack_map) but I don’t know this changing is a right work for my target that shows in the above pictures.

ptrblck · May 14, 2020, 7:40pm

Oh no, you shouldn’t use this replacement, since you are assigning the training set to the test set and will see a major data leakage in your experiments.

I would suggest to extend the attack_map table with all missing mappings.
I.e. since the test set contains more unique classes, you would have to add e.g. the mapping {'worm': 'U2R'}. Note that I don’t know how the additional classes should be mapped so this is just an example, since I’m not familiar in this domain.

arezoo_moradi · May 17, 2020, 8:58am

although you say right but in the above table just used 22 features for training and testing.
I don’t know how to can classification according to the article(A Deep Learning Approach to Network Intrusion Detection)
could anyone help me? I desperately need help.

ptrblck · May 17, 2020, 9:21am

As already said, I’m not familiar with the dataset and paper, but since the test set contains more attack types, which are neither in the attack_map, nor in the table, you could just drop these types.

arezoo_moradi · May 17, 2020, 9:50am

thanks for your response but I can’t solve this problem that said in the question.
(But when I run this line of code y_test = pd.get_dummies(y_test), instead of being categorized into 5 classes, it shows me the same 22 features, while I did the same thing train data (target = pd.get_dummies(target) and correct result), it using for the test data.)
I replaced this line kdd_t[‘class’] = kdd[‘class’].replace(attack_map) but you said this work is incrocet.

ptrblck · May 21, 2020, 7:44am

As explained, you test data frame contains more unique classes, which are not in the attack map and won’t be replaced.

Don’t do this, as you are reusing the training dataset and all your experiments will be invalid.

If you want to remove the additional classes from the test DataFrame, this code should work:

kdd['class'] = kdd['class'].replace(attack_map)
unique_classes = kdd['class'].unique()
kdd_t['class'] = kdd_t['class'].replace(attack_map)

# Drop unwanted classes
unique_classes_test = kdd_t['class'].unique()
drop_class_indices = (~np.in1d(unique_classes_test, unique_classes)).nonzero()[0]
for drop_class_idx in drop_class_indices:
    kdd_t = kdd_t[kdd_t['class'] != unique_classes_test[drop_class_idx]]

arezoo_moradi · May 21, 2020, 3:08pm

Thank you very much for your reply, but when I replace these commands and execute this line of code,

We rescale features to [0, 1]

min_max_scaler = MinMaxScaler ()
train = min_max_scaler.fit_transform (train)
test = min_max_scaler.transform (test)
I get this error,
File “C: \ Python \ Anaconda3 \ lib \ site-packages \ sklearn \ utils \ validation.py”, line 43, in _assert_all_finite
“or a value too large for% r.” % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype (‘float64’).
Which goes back to line 43, which contains this order,
kdd [‘class’] = kdd [‘class’]. replace (attack_map)
unique_classes = kdd [‘class’]. unique ()