Skip to content

Latest commit

 

History

History
45 lines (34 loc) · 1.59 KB

README.md

File metadata and controls

45 lines (34 loc) · 1.59 KB

Linkedin-sildeshare-scraper

Linkedin slideshare web crawler downloading files on SlideShare.

Installation

  • Python 2.7.*

  • Beautiful Soup 4

$ pip install bs4
  • Selenium Webdriver
$ pip install selenium
  • Replace or Update chromedriver to latest version according to your OS. Download

Usage

  1. Open the sharesilde_crawler.py with a text editor.
  2. Set parameters. scraper settings
  • output_path: the path you want to save files. Use ABSOLUTE PATH!
  • start_point: the page you start to scrape. I have set one for you, but you can change it!
  • username: Your Linkedin account. You'd better register another account for testing in case that Linkedin blocked your original account. :)(I used Selenium so it seems not gonna happen. But just in case.)
  • password: Your Linkedin password.
  • search_depth: Depth you want search into. I used DFS in search algorithms. The program will stop and certain depths. You can also stop the program manually.
  1. Run the program
$ python sharesilde_crawler.py

Linkedin will limit the number of downloads in 24 hours each account. So you can try more test accounts.

Results

results

The scraper will automatically download files in your output directory.(Resumes)

ToDoList

  • headless doesn't work...
  • Maybe Multiprocessing
  • Detect duplicate downloaded files