Web 爬虫数据集/语料库 Common Crawl

http://commoncrawl.org/big-picture/what-we-do/
Petabyte 级规模的网络爬行数据集，常用于学习词嵌入。

基于大数据的URL检索系统，给一个域名返回域名中被搜索引擎收录的URL。毫秒级响应 —-> https://url.fht.im

代码在这里 -> https://github.com/imfht/super-Django-CC 有效代码不到十行。
数据来源-> http://commoncrawl.org