-
Using Word2Vec+CNN to detect the Malicious URL and it's a really exquisite structure!
-
Finially result about 96.2% precision
-
High scalability supporting for Distributed System
-
Supporting for Online Learning
- Tensorflow 1.1.0
- Numpy
- Gensim 2.0.0
python train.py --help
usage: train.py [-h] [--data_file DATA_FILE] [--num_labels NUM_LABELS]
[--embedding_dim EMBEDDING_DIM] [--filter_sizes FILTER_SIZES]
[--num_filters NUM_FILTERS]
[--dropout_keep_prob DROPOUT_KEEP_PROB]
[--l2_reg_lambda L2_REG_LAMBDA] [--batch_size BATCH_SIZE]
[--num_epochs NUM_EPOCHS] [--evaluate_every EVALUATE_EVERY]
[--checkpoint_every CHECKPOINT_EVERY]
[--num_checkpoints NUM_CHECKPOINTS]
[--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
[--noallow_soft_placement]
[--log_device_placement [LOG_DEVICE_PLACEMENT]]
[--nolog_device_placement]
[--noreplicas] [--is_sync [IS_SYNC]] [--nois_sync]
[--ps_hosts PS_HOSTS] [--worker_hosts WORKER_HOSTS]
[--job_name JOB_NAME] [--task_index TASK_INDEX]
[--log_dir LOG_DIR]
optional arguments:
-h, --help show this help message and exit
--data_file DATA_FILE
Data source
--num_labels NUM_LABELS
Number of labels for data. (default: 2)
--embedding_dim EMBEDDING_DIM
Dimensionality of character embedding (default: 128)
--filter_sizes FILTER_SIZES
Comma-spearated filter sizes (default: '3,4,5')
--num_filters NUM_FILTERS
Number of filters per filter size (default: 128)
--dropout_keep_prob DROPOUT_KEEP_PROB
Dropout keep probability (default: 0.5)
--l2_reg_lambda L2_REG_LAMBDA
L2 regularization lambda (default: 0.0)
--batch_size BATCH_SIZE
Batch Size (default: 64)
--num_epochs NUM_EPOCHS
Number of training epochs (default: 200)
--evaluate_every EVALUATE_EVERY
Evalue model on dev set after this many steps
(default: 100)
--checkpoint_every CHECKPOINT_EVERY
Save model after this many steps (defult: 100)
--num_checkpoints NUM_CHECKPOINTS
Number of checkpoints to store (default: 5)
--allow_soft_placement [ALLOW_SOFT_PLACEMENT]
Allow device soft device placement
--noallow_soft_placement
--log_device_placement [LOG_DEVICE_PLACEMENT]
Log placement of ops on devices
--nolog_device_placement
--replicas [REPLICAS]
Use the dirstribution mode
--noreplicas
--is_sync [IS_SYNC] Use the async or sync mode
--nois_sync
--ps_hosts PS_HOSTS comma-separated lst of hostname:port pairs
--worker_hosts WORKER_HOSTS
comma-separated lst of hostname:port pairs
--job_name JOB_NAME job name:worker or ps
--task_index TASK_INDEX
Worker task index,should be >=0, task=0 is the master
worker task the performs the variable initialization
--log_dir LOG_DIR parameter and log info
Let's take 192.168.0.107 as ps server , 10.211.55.13 and 10.211.55.14 as training server.
Make every machine has a copy of the code.
On 192.168.0.107:
python train.py --replicas=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
On 10.211.55.13:
python train.py --replicas=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
On 10.211.55.14:
python train.py --replicas=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
On 192.168.0.107:
python train.py --replicas=True --is_sync=True --job_name=ps --task_index=0 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
On 10.211.55.13:
python train.py --replicas=True --is_sync=True --job_name=worker --task_index=0 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
On 10.211.55.14:
python train.py --replicas=True --is_sync=True --job_name=worker --task_index=1 --ps_hosts=192.168.0.107:2222\
--worker_hosts=10.211.55.13:2222,10.211.55.14:2222
python eval.py --help
usage: eval.py [-h] [--input_text_file INPUT_TEXT_FILE][--single_url SINGLE_URL]
[--input_label_file INPUT_LABEL_FILE] [--batch_size BATCH_SIZE]
[--checkpoint_dir CHECKPOINT_DIR] [--eval_train [EVAL_TRAIN]]
[--noeval_train]
[--allow_soft_placement [ALLOW_SOFT_PLACEMENT]]
[--noallow_soft_placement]
[--log_device_placement [LOG_DEVICE_PLACEMENT]]
[--nolog_device_placement]
python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints}
python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --single_url=hottraveljobs.com/forum/docs/info.php
Here I use the defualt checkpoint_dir to detection single_url
python eval.py --single_url=hottraveljobs.com/forum/docs/info.php
python eval.py --checkpoint_dir ./runs/{TIME_DIR}/checkpoints} --input_text_file="../data/data2.csv"
This is the HTTP service to load TensorFlow model and inference to predict malicious url.
Run HTTP server with [Django] and use HTTP client under /server
./manage.py runserver 0.0.0.0:8000
Use url as your GET parameter
127.0.0.1:8000/detection/predict/?url=appst0re.net/upload.aspx
And you will get
Success to predict appst0re.net/upload.aspx, result: bad
django-admin startproject server
python manage.py startapp detection
#Add customized urls and views.
[1] Cnn-Text-Classification-TF
[2] Convolutional Neural Networks for Sentence Classification
[3] Using Word2Vec+ CNN to Detect Malicious URL