Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

先后用了python3.7和python2.7跑,都跑不通 #5

Open
CoderRobin1992 opened this issue Dec 26, 2018 · 4 comments
Open

先后用了python3.7和python2.7跑,都跑不通 #5

CoderRobin1992 opened this issue Dec 26, 2018 · 4 comments

Comments

@CoderRobin1992
Copy link

是不是网站又更新了反爬策略?
报错如下:
2018-12-26 17:13:54 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: wenshu) 2018-12-26 17:13:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0f 25 May 2017), cryptography 2.1, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.2.1511-Core 2018-12-26 17:13:54 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wenshu.spiders', 'ROBOTSTXT_OBEY': True, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['wenshu.spiders'], 'BOT_NAME': 'wenshu', 'DOWNLOAD_DELAY': 3} 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled item pipelines: ['wenshu.pipelines.WenshuPipeline'] 2018-12-26 17:13:54 [scrapy.core.engine] INFO: Spider opened 2018-12-26 17:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-12-26 17:13:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-12-26 17:13:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://wenshu.court.gov.cn/robots.txt> (referer: None) 2018-12-26 17:14:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6> (referer: None) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:15 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/www/wenshu/wenshu/spiders/doc.py", line 95, in get_doc_list key = getkey(format_key_str).encode('utf-8') File "/www/wenshu/wenshu/utils/docid_v27.py", line 105, in getkey c = execjs.compile(js_str) File "/usr/lib/python2.7/site-packages/execjs/__init__.py", line 61, in compile return get().compile(source, cwd) File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 21, in get return get_from_environment() or _find_available_runtime() File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.") RuntimeUnavailableError: Could not find an available JavaScript runtime.

@zc3945
Copy link
Owner

zc3945 commented Dec 26, 2018

https://github.com/zc3945/caipanwenshu/tree/master/wenshu/wenshu/utils 请先尝试运行这里的docid.py 和vl5x.py 。如果报错,根据错误提示再解决

@CoderRobin1992
Copy link
Author

vl5x.py运行正常,docid.py运行报错如下:
Traceback (most recent call last): File "docid.py", line 115, in <module> key = getkey(RunEval).encode('utf-8') File "docid.py", line 104, in getkey js_str = unzip(str1).replace('_[_][_](', 'return ')[:-4] File "docid.py", line 100, in unzip return btou(get_js(fromBase64(str1))) File "docid.py", line 94, in get_js eval_js = execjs.compile(js_data) File "/usr/lib/python2.7/site-packages/execjs/__init__.py", line 61, in compile return get().compile(source, cwd) File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 21, in get return get_from_environment() or _find_available_runtime() File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.") execjs._exceptions.RuntimeUnavailableError: Could not find an available JavaScript runtime.

@CoderRobin1992
Copy link
Author

我用selenium爬取,从列表页进入,会更简单一些,不过效率相差太大了。。。

@CoderRobin1992
Copy link
Author

想请教下怎么跳过网站的IP检测,爬取部分完成了,但是现阶段已经触发了检测机制,跳出验证码,使用过一家的IP代理也不行,甚至提醒记录了mac address,无效用户

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants