Search advertisement platform
- Designed and developed web crawler which crawled 500000 product data from Amazon (Java, JSoup, Proxy)
- Developed Search Ads workflow support: Query understanding, Ads selection from inverted index (with MemCached), Ads ranking, Ads filter, Ads pricing, Ads allocation
- Employed MemCached as Ads Inverted index and built Ads forward index with MySQL Database which contain basic Ads information
- Built Ads Index Server which use gRPC to send Ads candidates to Ads Web Server
- Predict click probability with features generated from simulated search log
Used Jsoup to crawler information on Amazon.
- Finished
- extract price, product detail url, product image url, category from web page
- convert each product to Ads
- store Ads to file, each ads in JSON format.
- support paging
- log all exception
- Avoid Bot Detection: use Proxy IP and rotating Brower
- performance: use message queue to implement a distributed Web Crawler
Search advertising is placing online advertisments on front end pages that show results to users from their search engine queries. This search ads server takes thousands of product data as ads candidates and selects, filters, ranks, allocates and prices the ads when search query comes in. The selection and ranking of search ads is based on the quality of ads and the bid price offered by advertisers.
- Forward index for Ad detail information (MySQL)
Ad:
AdId CampaignID Keywords Bid Description Detail\_URL Category Title Brand Thumbnail
Campaign:
CampaignID Budget
- Inverted index for Ad keywords (Memcached)
- clean the text by Lucean
- train word2vector model using ads keywords corpus and use synonyms to rewrite query
Ads candiate will first be evaluated and filtered by relevance score. Relevance score is to measure how relevant query is to key words in ads. Here the Initial version of relevance score = number of word match query / total number of words in key words. For quick retreival of ads infomation, the inverted index of ads keywords were built and store in cache.
improvement:
use TF-IDF algorithm as the relevance score
The probability of user click (p-click) plays an important role in ads ranking.
Use spark ML process simulated user click log data and generate prediction model.
- Click log
log:
Device IP, Device id, Session id, Query, AdId, CampaignId, Ad_category_Query_category(0/1), clicked(0/1)
- Feature space
pClick Features extracted from search log and stored in key-value store
- Model
Logistic Regression
Quality Score = 0.25 * Relevance Score + 0.75 * pClick
Rank Score = Quality Score * Bid
Price(Cost Per Click) = next rank score / current quality score + 0.01
python ../python/spark-warehouse/generate_budget.py budget.json
python -m json.tool ads.txt
python ../../python/spark-warehouse/dedupe_ads.py ../crawled/ads.txt clean_ads1.txt
python generate_user.py user_small.txt
python generate_budget.py budget.txt
python ../python/spark-warehouse/generate_query_ad.py ../data/deduped/clean_ads.txt query_camp_ad_file.json campaign_weight_file.json ad_weight_file.json query_group_id_query_file.json campaignId_category_file.json campaignId_adId_file.json
python ../python/spark-warehouse/generate_click_log.py ../data/deduped/clean_ads.txt ../data/log/user_small.txt query_camp_ad_file.json campaign_weight_file.json ad_weight_file.json campaignId_category_file.json campaignId_adId_file.json click_log_small.txt
install j2ee eclipse
install mysql
install mysql-connector for java
install mysql-workbench
python ../../python/spark-warehouse/generate_word2vec_training_data.py ../deduped/clean_ads.txt word2vec_training_cleaned.txt
python ../../python/spark-warehouse/word2vec.py word2vec_training_cleaned.txt word2vec_training.txt
memcached -p 11219 -l 127.0.0.1 -d
python generate_synonmy.py ../../data/log/word2vec_training.txt ../../data/deduped/clean_ads.txt
python select_feature.py ../../simpleads/click_log_small.txt
python store_ctr_feature.py
python prepare_ctr_training_data.py /home/chengwei/Projects/SearchAds/simpleads/click_log_small.txt
python ctr_logistic.py
python ctr_gbdt.py
use http://localhost:9090/SearchAds?q=home%20theater%20sysmtem
&
did=87843
&
dip=32772
&
qclass=Electronics
CREATE TABLE `ad` (
`adId` int(11) NOT NULL,
`campaignId` int(11) DEFAULT NULL,
`keyWords` varchar(1024) DEFAULT NULL,
`bidPrice` double DEFAULT NULL,
`price` double DEFAULT NULL,
`thumbnail` mediumtext,
`description` mediumtext,
`brand` varchar(100) DEFAULT NULL,
`detail_url` mediumtext,
`category` varchar(1024) DEFAULT NULL,
`title`varchar(2048) DEFAULT NULL,
PRIMARY KEY (`adId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
CREATE TABLE `campaign` (
`campaignId` int(11) NOT NULL,
`budget` double DEFAULT NULL,
PRIMARY KEY(`campaignId`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;