-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
先后用了python3.7和python2.7跑,都跑不通 #5
Comments
https://github.com/zc3945/caipanwenshu/tree/master/wenshu/wenshu/utils 请先尝试运行这里的docid.py 和vl5x.py 。如果报错,根据错误提示再解决 |
vl5x.py运行正常,docid.py运行报错如下: |
我用selenium爬取,从列表页进入,会更简单一些,不过效率相差太大了。。。 |
想请教下怎么跳过网站的IP检测,爬取部分完成了,但是现阶段已经触发了检测机制,跳出验证码,使用过一家的IP代理也不行,甚至提醒记录了mac address,无效用户 |
是不是网站又更新了反爬策略?
报错如下:
2018-12-26 17:13:54 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: wenshu) 2018-12-26 17:13:54 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0f 25 May 2017), cryptography 2.1, Platform Linux-3.10.0-514.26.2.el7.x86_64-x86_64-with-centos-7.2.1511-Core 2018-12-26 17:13:54 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'wenshu.spiders', 'ROBOTSTXT_OBEY': True, 'CONCURRENT_REQUESTS': 8, 'SPIDER_MODULES': ['wenshu.spiders'], 'BOT_NAME': 'wenshu', 'DOWNLOAD_DELAY': 3} 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-12-26 17:13:54 [scrapy.middleware] INFO: Enabled item pipelines: ['wenshu.pipelines.WenshuPipeline'] 2018-12-26 17:13:54 [scrapy.core.engine] INFO: Spider opened 2018-12-26 17:13:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-12-26 17:13:54 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-12-26 17:13:58 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://wenshu.court.gov.cn/robots.txt> (referer: None) 2018-12-26 17:14:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6> (referer: None) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/List?sorttype=1&conditions=searchWord+1+AJLX++%E6%A1%88%E4%BB%B6%E7%B1%BB%E5%9E%8B:%E5%88%91%E4%BA%8B%E6%A1%88%E4%BB%B6) 2018-12-26 17:14:08 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:15 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:17 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/TreeContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) 2018-12-26 17:14:20 [scrapy.core.scraper] ERROR: Spider error processing <POST http://wenshu.court.gov.cn/List/ListContent> (referer: http://wenshu.court.gov.cn/List/TreeContent) Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 30, in process_spider_output for x in result: File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "/usr/lib64/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "/www/wenshu/wenshu/spiders/doc.py", line 95, in get_doc_list key = getkey(format_key_str).encode('utf-8') File "/www/wenshu/wenshu/utils/docid_v27.py", line 105, in getkey c = execjs.compile(js_str) File "/usr/lib/python2.7/site-packages/execjs/__init__.py", line 61, in compile return get().compile(source, cwd) File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 21, in get return get_from_environment() or _find_available_runtime() File "/usr/lib/python2.7/site-packages/execjs/_runtimes.py", line 49, in _find_available_runtime raise exceptions.RuntimeUnavailableError("Could not find an available JavaScript runtime.") RuntimeUnavailableError: Could not find an available JavaScript runtime.
The text was updated successfully, but these errors were encountered: