Twitter_spider for China.
🏠 Homepage
👤 h4m5t
- Website: www.h4m5t.top
- Github: @h4m5t
Introduce some ways to crawl Tweets for China Students so that they can do Scientific research or course projects.
已经实现以及未实现的功能:
✅模拟浏览器操作,爬取用户信息
✅模拟浏览器操作,爬取推文
✅根据用户ID生成URL
❌ 数据读入和输出保存(CSV型、SQL型)
❌ 多线程爬取
-
1. 借助第三方爬推特库
-
https://github.com/jonbakerfish/TweetScraper
但都会报错:
WARNING:root:Error retrieving https://twitter.com/: Timeout(ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x0000023F9CCDFF10>, 'Connection to twitter.com timed out. (connect timeout=10)')), retrying
解决方法:
-
使用VPN(ExpressVPN\NordVPN)(缺点:比较贵)
基于sock的一系列翻墙软件,比如SR、SSR、v2ray等,在使用twint时会出现这个错误。这是因为sock工作在TCP/IP五层网络模型的第五层的应用层和第四层的传输层之间,层级太高,有时候如果网络连接走的层级比较低就会出错。可以试一下VPN,它工作在第三层的网络层,使用twint的时候用的ExpressVPN,就没有出现问题。还有需要提醒一点的是不要把SSR和VPN弄混了,虽然都能翻墙,但是原理不一样的。所以就会出现你用SSR能翻墙上网,但是有时候在一些场景又会出错,其实就是sock的工作原理的限制。
参考链接:
-
对本机设置全局代理(已尝试,不太可行)
-
在IDE设置代理(已尝试,不太可行)
-
设置twint的config(已尝试,不太可行)
config=twint.Config()
config.Proxy_host = '127.0.0.1'
config.Proxy_port = 7890
config.Proxy_type = "socks5"
-
把脚本放在国外VPS上(可行)
-
digitalocean
-
vultr
-
hostwinds
-
Linode
-
搬瓦工
-
-
-
2.使用Twitter-developer-API
The Twitter APl enables programmatic access toTwitter in unique and advanced ways.Use it to analyze, learn from, and interact with Tweets,Direct Messages, users, and other key Twitter resources.
-
3.借助第三方爬虫库
- Scrapy
- requests
- urllib
- BeautifulSoup
-
4.借助数据采集器
-
5.selenium模拟浏览器操作
注意:
-
打开chrome,地址栏输入chrome://version 查看浏览器版本,安装对应版本的chromedriver
-
控制下拉、翻页等操作,要设置相应的延迟
-
推特对不同IP有不同的限制策略,有些地区需要登陆才可看见推文,有些不用。
-
如果需要导入浏览器数据,使用webdriver之前需要关闭chrome,防止user_data被占用
-
- requests
- selenium
- twint
- csv
- time
- datetime
- urllib
-
generate_url.py 根据用户ID生成对应的url,保存在url.txt
-
test*.py 用来测试相关爬虫库、代理设置、模拟浏览器操作
-
Twitter.csv 为100个涉华人员的相关信息
-
user_info.py 爬取关注者被关注者数量
-
user_tweets.py 爬取对应用户的推文
The best way to build a standard query and test if it’s valid and will return matched Tweets is to first try it at twitter.com/search. As you get a satisfactory result set, the URL loaded in the browser will contain the proper query syntax that can be reused in the standard search API endpoint. Here’s an example:
- We want to search for Tweets referencing @TwitterDev account. First, we run the search on twitter.com/search
- Check and copy the URL loaded. In this case, we got: https://twitter.com/search?q=%40twitterdev
- Replace https://twitter.com/search with https://api.twitter.com/1.1/search/tweets.json and you will get: https://api.twitter.com/1.1/search/tweets.json?q=%40twitterdev
- Run a Twurl command to execute the search.
Please note that the API requires that the request be authenticated (check Authentication & Authorization documentation for more details on this). Note that the standard search API only serves data from the last week. If you need historical data odler than seven days, check out the premium and enterprise search APIs.
Operator | Finds Tweets... |
---|---|
watching now | containing both “watching” and “now”. This is the default operator. |
“happy hour” | containing the exact phrase “happy hour”. |
love OR hate | containing either “love” or “hate” (or both). |
beer -root | containing “beer” but not “root”. |
#haiku | containing the hashtag “haiku”. |
from:interior | sent from Twitter account “interior”. |
list:NASA/astronauts-in-space-now | sent from a Twitter account in the NASA list astronauts-in-space-now |
to:NASA | a Tweet authored in reply to Twitter account “NASA”. |
@NASA | mentioning Twitter account “NASA”. |
politics filter:safe | containing “politics” with Tweets marked as potentially sensitive removed. |
puppy filter:media | containing “puppy” and an image or video. |
puppy -filter:retweets | containing “puppy”, filtering out retweets |
puppy filter:native_video | containing “puppy” and an uploaded video, Amplify video, Periscope, or Vine. |
puppy filter:periscope | containing “puppy” and a Periscope video URL. |
puppy filter:vine | containing “puppy” and a Vine. |
puppy filter:images | containing “puppy” and links identified as photos, including third parties such as Instagram. |
puppy filter:twimg | containing “puppy” and a pic.twitter.com link representing one or more photos. |
hilarious filter:links | containing “hilarious” and linking to URL. |
puppy url:amazon | containing “puppy” and a URL with the word “amazon” anywhere within it. |
superhero since:2015-12-21 | containing “superhero” and sent since date “2015-12-21” (year-month-day). |
puppy until:2015-12-21 | containing “puppy” and sent before the date “2015-12-21”. |
movie -scary :) | containing “movie”, but not “scary”, and with a positive attitude. |
flight :( | containing “flight” and with a negative attitude. |
traffic ? | containing “traffic” and asking a question. |
注意:需要使用URL encode(可使用在线网站进行转换)
请参考:
- https://en.wikipedia.org/wiki/Percent-encoding
- https://www.w3schools.com/tags/ref_urlencode.ASP
- https://tool.chinaz.com/tools/urlencode.aspx
- https://meyerweb.com/eric/tools/dencoder/
https://github.com/xs71/TwitterSpider
https://github.com/ALL-AC/tweet-analysis
https://github.com/ZekangZhouKGR/twitter_bot
https://github.com/ChangxingJiang/CxSpider
https://www.4008140202.com/pp/20191117124856_4774_3977640913/news
如何从推特挖掘情报:一个流行工具的具体介绍 - iyouport (@iyouport)
CxSpider/Twitter_Account_Post.py at master · ChangxingJiang/CxSpider
python selenium设置chrome浏览器保持登录方式_小王子博客-CSDN博客_selenium 保持登录
selenium设置chrome浏览器保持登录方式两种options和cookie - 豌豆ip代理
User Data Directory Is Already In Use - Katalon Studio / Web Testing - Katalon Community
Essie0715/Twitter_Data_Collection: Twitter爬虫学习
masonsxu/Selenium_Crawler: 一个使用selenium模块爬取(Twitter、New York Times)网站的可配置爬虫代码
Give a ⭐️ if this project helped you!
This README was generated with ❤️ by readme-md-generator