2013-04-14 3 views
2

다음과 같이 nutch 2.1로 일부 URL을 크롤링하려고합니다. Nutch 크롤링 오류는 없지만 결과는 아무것도 없음

bin/nutch crawl urls -dir crawl -depth 3 -topN 5 

http://wiki.apache.org/nutch/NutchTutorial

에는 오류가 없지만,하기의 폴더를 만들 수 없습니다.

crawl/crawldb 
crawl/linkdb 
crawl/segments 

아무도 도와 줄 수 있습니까? 나는이 문제를 이틀 동안 해결하지 못했다. 감사합니다.

출력은 다음과 같습니다.

FetcherJob: threads: 10 
FetcherJob: parsing: false 
FetcherJob: resuming: false 
FetcherJob : timelimit set for : -1 
Using queue mode : byHost 
Fetcher: threads: 10 
QueueFeeder finished: total 0 records. Hit by time limit :0 
-finishing thread FetcherThread1, activeThreads=0 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold sequence: 5 
-finishing thread FetcherThread2, activeThreads=7 
-finishing thread FetcherThread3, activeThreads=6 
-finishing thread FetcherThread4, activeThreads=5 
-finishing thread FetcherThread5, activeThreads=4 
-finishing thread FetcherThread6, activeThreads=3 
-finishing thread FetcherThread7, activeThreads=2 
-finishing thread FetcherThread0, activeThreads=1 
-finishing thread FetcherThread8, activeThreads=0 
-finishing thread FetcherThread9, activeThreads=0 
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues 
-activeThreads=0 
ParserJob: resuming: false 
ParserJob: forced reparse: false 
ParserJob: parsing all 
FetcherJob: threads: 10 
FetcherJob: parsing: false 
FetcherJob: resuming: false 
FetcherJob : timelimit set for : -1 
Using queue mode : byHost 
Fetcher: threads: 10 
QueueFeeder finished: total 0 records. Hit by time limit :0 
-finishing thread FetcherThread1, activeThreads=0 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold sequence: 5 
-finishing thread FetcherThread2, activeThreads=7 
-finishing thread FetcherThread3, activeThreads=6 
-finishing thread FetcherThread4, activeThreads=5 
-finishing thread FetcherThread5, activeThreads=4 
-finishing thread FetcherThread6, activeThreads=3 
-finishing thread FetcherThread7, activeThreads=2 
-finishing thread FetcherThread0, activeThreads=1 
-finishing thread FetcherThread8, activeThreads=0 
-finishing thread FetcherThread9, activeThreads=0 
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues 
-activeThreads=0 
ParserJob: resuming: false 
ParserJob: forced reparse: false 
ParserJob: parsing all 
FetcherJob: threads: 10 
FetcherJob: parsing: false 
FetcherJob: resuming: false 
FetcherJob : timelimit set for : -1 
Using queue mode : byHost 
Fetcher: threads: 10 
QueueFeeder finished: total 0 records. Hit by time limit :0 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold sequence: 5 
-finishing thread FetcherThread9, activeThreads=9 
-finishing thread FetcherThread0, activeThreads=8 
-finishing thread FetcherThread1, activeThreads=7 
-finishing thread FetcherThread2, activeThreads=6 
-finishing thread FetcherThread3, activeThreads=5 
-finishing thread FetcherThread4, activeThreads=4 
-finishing thread FetcherThread5, activeThreads=3 
-finishing thread FetcherThread6, activeThreads=2 
-finishing thread FetcherThread7, activeThreads=1 
-finishing thread FetcherThread8, activeThreads=0 
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0.0 pages/s, 0 0 kb/s, 0 URLs in 0 queues 
-activeThreads=0 
ParserJob: resuming: false 
ParserJob: forced reparse: false 
ParserJob: parsing all 

런타임/지방/CONT/nutch-site.xml의

<?xml version="1.0"?> 
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 

<!-- Put site-specific property overrides in this file. --> 


<configuration> 
<property> 
<name>http.agent.name</name> 
<value>My Nutch Spider</value> 
</property> 

<property> 
<name>storage.data.store.class</name> 
<value>org.apache.gora.hbase.store.HBaseStore</value> 
<description>Default class for storing data</description> 
</property> 
<property> 
    <name>http.robots.agents</name> 
    <value>My Nutch Spider</value> 
    <description>The agent strings we'll look for in robots.txt files, 
    comma-separated, in decreasing order of precedence. You should 
    put the value of http.agent.name as the first agent name, and keep the 
    default * at the end of the list. E.g.: BlurflDev,Blurfl,* 
    </description> 
</property> 
<property> 
    <name>http.content.limit</name> 
    <value>262144</value> 
</property> 
</configuration> 

런타임/지방/CONT/정규식-urlfilter.txt

# accept anything else 
+. 

런타임/지방/된 URL/시드 .txt

http://nutch.apache.org/ 

답변

3

Nutch 2.X를 사용하면서 follo 관련 tutorial. 당신이 준 것은 Nutch 1.x를위한 것입니다. Nutch 2.X는 HBase, Cassandra와 같은 외부 저장소 백엔드를 사용하므로 crawldb, 세그먼트 등 디렉토리는 형성되지 않습니다.

또한 bin/nutch 명령 대신 bin/crawl 스크립트를 사용하십시오.

+0

답장을 보내 주셔서 감사합니다. 나는 알아 냈다! –

+0

감사합니다. 나는 오래 동안 Nutch 2.X 튜토리얼을 찾으려고 노력 해왔다. 내가 찾을 수있는 것은 http://nlp.solutions.asia/?p=180이며 Nutch wiki 사이트에는 없습니다. 답장이 도움이됩니다. 또한 bin/nutch 대신 bin/crawl을 사용해야한다는 것을 결코 깨닫지 못했습니다. 나는 이전에 Nutch 1.X를 사용 해왔고 최근에는 많은 문제가있는 2.X에서 시작되었습니다. 귀하의 게시물에서 내가 깨달은 한 가지 분명한 실수는 bin/nutch 사용법입니다. bin/crawl 스크립트 사용에 대한 자세한 정보는 어디에서 찾을 수 있습니까? – sunskin

+0

@ user1830069 그것의 쉬운 스크립트. U는 그것을 통해 이해할 수 있습니다. –

관련 문제