2
다음과 같이 nutch 2.1로 일부 URL을 크롤링하려고합니다. Nutch 크롤링 오류는 없지만 결과는 아무것도 없음
bin/nutch crawl urls -dir crawl -depth 3 -topN 5
http://wiki.apache.org/nutch/NutchTutorial
에는 오류가 없지만,하기의 폴더를 만들 수 없습니다.crawl/crawldb
crawl/linkdb
crawl/segments
아무도 도와 줄 수 있습니까? 나는이 문제를 이틀 동안 해결하지 못했다. 감사합니다.
출력은 다음과 같습니다.
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
-finishing thread FetcherThread1, activeThreads=0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread3, activeThreads=6
-finishing thread FetcherThread4, activeThreads=5
-finishing thread FetcherThread5, activeThreads=4
-finishing thread FetcherThread6, activeThreads=3
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread0, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
-finishing thread FetcherThread9, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : -1
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 0 records. Hit by time limit :0
Fetcher: throughput threshold: -1
Fetcher: throughput threshold sequence: 5
-finishing thread FetcherThread9, activeThreads=9
-finishing thread FetcherThread0, activeThreads=8
-finishing thread FetcherThread1, activeThreads=7
-finishing thread FetcherThread2, activeThreads=6
-finishing thread FetcherThread3, activeThreads=5
-finishing thread FetcherThread4, activeThreads=4
-finishing thread FetcherThread5, activeThreads=3
-finishing thread FetcherThread6, activeThreads=2
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
-activeThreads=0
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: parsing all
런타임/지방/CONT/nutch-site.xml의
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>My Nutch Spider</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.robots.agents</name>
<value>My Nutch Spider</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.content.limit</name>
<value>262144</value>
</property>
</configuration>
런타임/지방/CONT/정규식-urlfilter.txt
# accept anything else
+.
런타임/지방/된 URL/시드 .txt
http://nutch.apache.org/
답장을 보내 주셔서 감사합니다. 나는 알아 냈다! –
감사합니다. 나는 오래 동안 Nutch 2.X 튜토리얼을 찾으려고 노력 해왔다. 내가 찾을 수있는 것은 http://nlp.solutions.asia/?p=180이며 Nutch wiki 사이트에는 없습니다. 답장이 도움이됩니다. 또한 bin/nutch 대신 bin/crawl을 사용해야한다는 것을 결코 깨닫지 못했습니다. 나는 이전에 Nutch 1.X를 사용 해왔고 최근에는 많은 문제가있는 2.X에서 시작되었습니다. 귀하의 게시물에서 내가 깨달은 한 가지 분명한 실수는 bin/nutch 사용법입니다. bin/crawl 스크립트 사용에 대한 자세한 정보는 어디에서 찾을 수 있습니까? – sunskin
@ user1830069 그것의 쉬운 스크립트. U는 그것을 통해 이해할 수 있습니다. –