파이썬으로 스캔 한 페이지를 단어 단위로 분할하는 방법은 무엇입니까?

텍스트 한 장을 한 단어 씩 여러 이미지로 자르는 방법이 있습니까? 즉, 'n'단어로 페이지를 스캔하면 스크립트는 'n'개의 별도 이미지를 생성해야합니다.파이썬으로 스캔 한 페이지를 단어 단위로 분할하는 방법은 무엇입니까?

(사용 파이썬)

당신은 Blob Detection 볼 필요가

출처

2011-03-09 Arackna

이것은 당신이 OCR을 사용할 수없는 가정, 매우 familer와 합니다만 지역되지 않습니다 (당신의 텍스트를 읽을이나 뭐 때문에), 나는 (아마도 순진) 같은 것을 시도 할 것이다 :

"이러한주의 : 메모리
에
로드 화상 데이터
에만 흰 화소를 가로 질러 모든 방법이 각각"로우 "를 찾아 이미지의 행으로 픽셀 데이터를 분할 열 "각"흰색 행 "에서 흰 틈을 찾으려고
새로운 x, y 좌표를 모두 가져 와서 이미지를 자릅니다.

사실,이 때문에 나는 그것이 a를 pyPNG 모듈로 이동했다 재미있는 운동처럼 소리 : 재미있을 것 같다 난 그 :)을 읽을 수 있습니다

import png 
import sys 

KERNING = 3 

def find_rows(pixels,width, height): 
    "find all rows that are purely white" 
    white_rows = [] 
    is_white = False 
    for y in range(height): 
     if sum(sum(pixels[(y*4*width)+x*4+p] for p in range(3)) for x in range(width)) >= width*3*254: 
      if not is_white: 
       white_rows.append(y) 
      is_white = True 
     else: 
      is_white = False 
    return white_rows 

def find_words_in_image(blob, tolerance=30):  
    n = 0 
    r = png.Reader(bytes=blob) 
    (width,height,pixels_rows,meta) = r.asRGBA8() 
    pixels = [] 
    for row in pixels_rows: 
     for px in row: 
      pixels.append(px) 
    # find each horizontal line 
    white_rows = find_rows(pixels,width,height) 
    # for each line try to find a white vertical gap 
    for i,y in enumerate(white_rows): 
     if y >= len(white_rows): 
      continue 
     y2 = white_rows[i+1] 
     height_of_row = y2 - y 
     is_white = False 
     white_cols = [] 
     last_black = -100 
     for x in range(width-4): 
      s = y*4*width+x*4 
      if sum(pixels[s+y3*4*width] + pixels[s+y3*4*width+1] + pixels[s+y3*4*width+2] for y3 in range(height_of_row)) >= height_of_row*3*240: 
       if not is_white: 
        if len(white_cols)>0 and x-last_black < KERNING: 
         continue 
        white_cols.append(x) 
       is_white = True 
      else: 
       is_white = False 
       last_black = x 
     # now we have a list of x,y co-oords for all the words on this row 
     for j,x in enumerate(white_cols): 
      if j >= len(white_cols)-1: 
       continue 
      wordpx = [] 
      new_width = white_cols[j+1]-x 
      new_height = y2-y 
      x_offset = x*4 
      for h in range(new_height): 
       y_offset = (y+h)*4*width 
       start = x_offset+y_offset 
       wordpx.append(pixels[start:start+(new_width*4)]) 
      n += 1 
      with open('word%s.png' % n, 'w') as f: 
       w = png.Writer(
        width=new_width, 
        height=new_height, 
        alpha=True 
        ) 
       w.write(f,wordpx) 
    return n 



if __name__ == "__main__": 
    # 
    # USAGE: python png2words.py yourpic.png 
    # 
    # OUTPUT: [word1.png...word2.png...wordN.png] 
    # 
    n = find_words_in_image(open(sys.argv[1]).read()) 
    print "found %s words" % n

출처

2011-03-09 22:31:16

와우 깔끔한 읽을 수 있습니다. 잘 주석 처리 된 코드, 대단히 감사합니다. 정확히이 *는 제가 찾고있는 코드입니다. 나는 더 많은 것을 배워야하고 이와 같은 코드를 작성하려고 노력해야한다. 다시 한번 감사한다 !! :) – Arackna

무한대 : 대답을 수락 한 것으로 표시하십시오 (위쪽/아래쪽 화살표 아래의 체크 표시). 그것은 당신에게 2 명의 담당자를 줄 것이고, 그것이 효과가 있었다는 것을 다른 사람들에게 알리게 될 것입니다. – Wilduck

Wilduck : 완료 !! :) – Arackna

,이 이미지 처리 기술이다. 또한이 질문은 파이썬과 관련이 없지만 파이썬 블랍 탐지 라이브러리를 검색하는 것이 도움이 될 수 있습니다.

출처

2011-03-09 19:29:17 cmaynard

, 감사합니다! – Arackna

파이썬으로 스캔 한 페이지를 단어 단위로 분할하는 방법은 무엇입니까?

답변

관련 문제