2013-12-11 1 views
0

웹 페이지에 포함 된 PDF에서 텍스트를 추출하려고합니다. pdf-reader gem을 사용해 보았지만 구문 분석 오류가 발생합니다.포함 된 PDF (Ruby)에서 데이터를 추출 할 수 없습니다.

`find_first_xref_offset': PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:99:in `load_offsets' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/xref.rb:60:in `initialize' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `new' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader/object_hash.rb:44:in `initialize' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `new' 
from /opt/boxen/rbenv/versions/2.0.0-p247/lib/ruby/gems/2.0.0/gems/pdf-reader-1.3.3/lib/pdf/reader.rb:117:in `initialize' 
from role.rb:5:in `new' 
from role.rb:5:in `<main>' 

this is the file

사람은 내가이 문제를 해결할 수있는 방법을 알아? 이 목적을위한 더 나은 보석이 있습니까?

감사합니다.

답변

0

Google에서 문제를 찾고있는 동안 발견했습니다. 문제를 해결하는 데 사용할 수있는 것이 있습니까?

################################################################# 
# Extract text from a PDF file 
# This scraper takes about 2 minutes to run and no output 
# appears until the end. 
################################################################# 
# This scraper uses the pdf-reader gem. 
# Documentation is at https://github.com/yob/pdf-reader#readme 
# If you have problems you can ask for help at http://groups.google.com/group/pdf-reader 
require 'pdf-reader' 
require 'open-uri' 

########## This section contains the callback code that processes the PDF file contents ###### 
class PageTextReceiver 
    attr_accessor :content, :page_counter 
    def initialize 
    @content = [] 
    @page_counter = 0 
    end 
    # Called when page parsing starts 
    def begin_page(arg = nil) 
    @page_counter += 1 
    @content << "" 
    end 
    # record text that is drawn on the page 
    def show_text(string, *params) 
    @content.last << string 
    end 
    # there's a few text callbacks, so make sure we process them all 
    alias :super_show_text :show_text 
    alias :move_to_next_line_and_show_text :show_text 
    alias :set_spacing_next_line_show_text :show_text 
    # this final text callback takes slightly different arguments 
    def show_text_with_positioning(*params) 
    params = params.first 
    params.each { |str| show_text(str) if str.kind_of?(String)} 
    end 
end 
################ End of TextReceiver ############################# 

# If you don't have two minutes to wait you might prefer this 
# smaller pdf 
# pdf = open('http://www.hmrc.gov.uk/factsheets/import-export.pdf') 
# pdf = open('http://www.madingley.org/uploaded/Hansard_08.07.2010.pdf') 
pdf = open('http://dl.dropbox.com/u/6928078/CLEI_2008_002.pdf') 

####### Instantiate the receiver and the reader 
receiver = PageTextReceiver.new 
pdf_reader = PDF::Reader.new 
####### Now you just need to make the call to parse... 
pdf_reader.parse(pdf, receiver) 
####### ...and do whatever you want with the text. 
####### This just outputs it. 
receiver.content.each {|r| puts r.strip} 
+0

나는 여전히 같은 문제가 있습니다. URL에 직접 파일에 액세스 해 보았습니다. PDF를 다운로드하여 로컬에서 읽을 수있었습니다. [이것은 파일입니다] (http://www.tesoreria.cl/portal/portlets/imprimirAR/printAR.do?rutrol=32807514010&t=C&formulario=30&folio=3287514413&vcto=2013-11-30) – felipecamposclarke

관련 문제