2014-01-06 5 views
0

저는이 작은 조각을 몇 시간 씩 작업 해 왔으며 해결책을 찾지 못했습니다. 간단해야합니다. 이번에는 간단한 코드가 아니라 실제 코드를 게시 할 것입니다. 어떻게 든 실제 코드로 작업 할 예제를 얻을 수 없기 때문입니다.BeautifulSoup없이 파일 구문 분석

내장 모듈을 사용하여이 작업을 수행하려고합니다. bs4를 사용하여 답변을 얻은 경우에도 알고 싶습니다. 그것은 간단한 일이되어야합니다.

두 개의 파일, 이렇게가는 HTML 파일이 있습니다.

<b>Match #139</b></font></td></tr><tr bgcolor="#EEEEEE"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap="">&nbsp;<font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3822pb01">3822pb01</a>&nbsp;</font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Left with 'POLICE' Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>:&nbsp;<a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv">&nbsp;</font></td></tr><tr bgcolor="#FFFFFF"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap="">&nbsp;<font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3821pb01">3821pb01</a>&nbsp;</font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Right with 'POLICE' Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>:&nbsp;<a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv">&nbsp;</font></td></tr><tr bgcolor="#5E5A80"><td colspan="4"><font face="Tahoma,Arial" size="2" color="#FFFFFF">&nbsp;<b>Match #140</b></font></td></tr><tr bgcolor="#EEEEEE"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap="">&nbsp;<font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3822pb02">3822pb02</a>&nbsp;</font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Left with Classic Fire Logo Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>:&nbsp;<a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv">&nbsp;</font></td></tr><tr bgcolor="#FFFFFF"><td align="CENTER" width="10%"><font color="Green" face="Tahoma,Arial" size="2"><b>Yes</b></font></td><td nowrap="">&nbsp;<font face="Tahoma,Arial" size="2"><a href="http://www.bricklink.com/catalogItem.asp?P=3821pb02">3821pb02</a>&nbsp;</font></td><td><font face="Tahoma,Arial" size="2"><b>Door 1 x 3 x 1 Right with Classic Fire Logo Pattern</b></font><font class="fv"><br><a href="http://www.bricklink.com/catalog.asp">Catalog</a>: <a href="http://www.bricklink.com/catalogTree.asp?itemType=P">Parts</a>:&nbsp;<a href="http://www.bricklink.com/catalogList.asp?catType=P&catID=642">Door, Decorated</a></font></td><td nowrap=""><font class="fv">&nbsp;</font></td></tr><tr bgcolor="#5E5A80"><td colspan="4"><font face="Tahoma,Arial" size="2" color="#FFFFFF">&nbsp;<b> 

제발 죽이지 마세요. 예, 단지 한 줄입니다. 코드 편집기에 붙여 넣으면 여러 줄로 볼 수 있습니다. 파일은 더 많은 "일치"로 계속됩니다.

두 가지 일을하고 싶습니다.

첫 번째로 일치 번호를 색인 번호로 사용하는 사전을 만들고 싶습니다. 당신이 경기 후 첫 번째 링크 후에 HTML 보면 따라서, 예를 들어, 다음

matches = {'139' : 'etc', '140' : 'etc'} 

그리고 것, 부품 번호가, 예에서 첫 번째 3822pb01입니다. 일반적으로 일치하는 부분에는 2 개의 부품 번호가 있으며,이 2 개의 부품 번호를 사용하여 dict 내부에 튜플을 생성하려고합니다.

matches = {'139' : ['3822pb01', '3821pb01'], '140' : ['3822pb02', 3821pb02]} 

지금까지, 나는 부품 번호, 또는 일치 #의 빼내야 년대를하지만, 부품 번호 : 상관 관계가없는 년대와 일치 # 's을 (를) 할 수 있었다.

누군가 내가이 접근에 도움이 될 수 있습니까? - 그것은 현재의 지식에서 조금 벗어납니다. http://pastebin.com/raw.php?i=eWWh4XfM - -


여기에 전체 HTML 파일의 HTML을 사용하여 최적의 서식을

+1

왜하지 BeautifulSoup로하고 싶니? 이것을위한 이상적인 도구처럼 보입니다. –

+0

여기에서 좀 더 많은 컨텍스트를 공유 할 수 있습니까? 이 부분을 테이블에서 뜯어 낸 것 같습니다. –

+0

학사 학위가 너무 많아서 일상 업무를 빨리 처리 할 수있는 몇 가지 간단한 방법을 배우려하고 있습니다. 그러나 나는 언젠가는 배워야하고 조금은 학사 학위를 알고 싶습니다. 그렇게 할 수있는 방법이 있다면, 그걸 듣게되어 기쁩니다. 나는 아직 자신의 문서에 들어가기를 원하지 않고, 누군가 나를 위해 일할 것을 요구하는 것처럼 들리려고하지 않는다. (나는 그것이 어쨌든 내가하고있는 일이라고 생각한다). –

답변

2

이없는 BeautifulSoup로 :

import re 
from bs4 import BeautifulSoup 

matches = {} 
_catalog_link = re.compile(r'^http://www\.bricklink\.com/catalogItem\.asp\?P=') 

soup = BeautifulSoup(htmlpage) 

for match in soup.find_all(text=re.compile(r'Match #\d+')): 
    match_number = match.string.split('#', 1)[-1] 
    matches[match_number] = matched_links = [] 
    # Find the parent table row 
    row = next(p for p in match.parents if p.name == 'tr') 
    # next rows hold the links 
    for sibling in row.next_siblings: 
     if sibling.name != 'tr': 
      continue 
     links = sibling.find_all('a', href=_catalog_link) 
     if not links: 
      break 
     matched_links.extend(l.string for l in links) 

이 생성됩니다

{u'139': [u'3822pb01', u'3821pb01'], 
u'140': [u'3822pb02', u'3821pb02'], 
u'141': [u'3822pb06', u'3821pb06'], 
u'142': [u'3822p03', u'3821p03'], 
u'143': [u'3822p24', u'3821p24'], 
u'144': [u'3822pb05', u'3821pb05'], 
u'145': [u'3822pb04', u'3821pb04'], 
u'146': [u'3822px1', u'3821px1'], 
u'147': [u'3822', u'3821'], 
u'148': [u'3189', u'3188'], 
u'149': [u'801a', u'802a'], 
u'150': [u'801', u'802'], 
u'151': [u'445', u'446'], 
u'152': [u'825', u'826'], 
u'153': [u'825p01', u'826p01'], 
u'154': [u'825p02', u'826p02'], 
u'155': [u'3195', u'3194'], 
u'156': [u'30231pb02', u'30231pb01'], 
u'158': [u'30230px1', u'30230px2'], 
u'159': [u'3936', u'3935'], 
u'160': [u'30355', u'30356'], 
u'161': [u'3586', u'3585'], 
u'162': [u'3933', u'3934'], 
u'164': [u'981', u'982'], 
u'165': [u'43369', u'43368'], 
u'166': [u'972', u'971'], 
u'167': [u'972pa2', u'971pa2'], 
u'168': [u'972p4f', u'971p4f'], 
u'169': [u'972p63', u'971p63'], 
u'170': [u'30073', u'30074'], 
u'171': [u'6128', u'6127'], 
u'172': [u'4466', u'4467'], 
u'173': [u'fabah1', u'fabah2'], 
u'174': [u'x46', u'x48'], 
u'175': [u'4181', u'4182'], 
u'176': [u'4181p05', u'4182p05'], 
u'177': [u'4181pb01', u'4182pb01'], 
u'178': [u'4181p02', u'4182p02'], 
u'179': [u'4181p06', u'4182p06'], 
u'180': [u'4181p04', u'4182p04'], 
u'181': [u'4181px1', u'4182px1'], 
u'182': [u'4181p03', u'4182p03'], 
u'183': [u'4181p01', u'4182p01'], 
u'184': [u'4181p07', u'4182p07'], 
u'185': [u'3195px1', u'3194px1'], 
u'186': [u'32190', u'32191'], 
u'187': [u'32188', u'32189'], 
u'188': [u'32527', u'32528'], 
u'189': [u'32534', u'32535'], 
u'190': [u'44350', u'44351'], 
u'191': [u'44352', u'44353'], 
u'192': [u'47712', u'47713'], 
u'193': [u'42061', u'42060'], 
u'194': [u'43710', u'43711'], 
u'195': [u'41765', u'41764'], 
u'196': [u'41748', u'41747'], 
u'197': [u'41750', u'41749'], 
u'198': [u'6565', u'6564'], 
u'199': [u'41770', u'41769'], 
u'200': [u'43723', u'43722'], 
u'201': [u'43721', u'43720'], 
u'202': [u'41768', u'41767'], 
u'203': [u'3069bps5', u'3069bps4'], 
u'204': [u'42061pb03', u'42060pb03'], 
u'205': [u'42061pb05', u'42060pb05'], 
u'206': [u'3005pb001', u'3005pb002'], 
u'207': [u'48288pb02', u'48288pb01'], 
u'208': [u'2582pb03', u'2582pb04'], 
u'209': [u'712', u'713'], 
u'211': [u'3039px17', u'3039px18'], 
u'212': [u'3037px5', u'3037px6'], 
u'213': [u'3037px3', u'3037px4'], 
u'214': [u'30249pb02', u'30249pb01'], 
u'215': [u'42022pb09', u'42022pb08'], 
u'216': [u'42022pb05', u'42022pb06'], 
u'217': [u'30647pb05', u'30647pb04'], 
u'218': [u'30647pb01', u'30647pb02'], 
u'219': [u'30647pb07', u'30647pb06'], 
u'220': [u'30647px1', u'30647px2'], 
u'221': [u'2744pb02', u'2744pb01'], 
u'222': [u'42061px5', u'42060px5'], 
u'223': [u'42061pb01', u'42060pb01'], 
u'224': [u'42061px1', u'42060px1'], 
u'225': [u'41748pb05', u'41747pb05'], 
u'226': [u'41748pb16', u'41747pb16'], 
u'227': [u'41748pb12', u'41747pb12'], 
u'228': [u'41748pb15', u'41747pb15'], 
u'229': [u'41748pb07', u'41747pb07'], 
u'230': [u'41748px1', u'41747px1'], 
u'231': [u'41748pb06', u'41747pb06'], 
u'232': [u'41748pb14', u'41747pb14'], 
u'233': [u'41748pb02', u'41747pb02'], 
u'234': [u'41748pb04', u'41747pb04'], 
u'235': [u'41748pb09', u'41747pb09'], 
u'236': [u'41748pb08', u'41747pb08'], 
u'237': [u'41748pb11', u'41747pb11'], 
u'238': [u'41748pb03', u'41747pb03'], 
u'239': [u'41748pb13', u'41747pb13'], 
u'240': [u'41748pb10', u'41747pb10'], 
u'241': [u'41750px2', u'41749px2'], 
u'242': [u'41750pb01', u'41749pb01'], 
u'243': [u'6565pb01', u'6564pb01'], 
u'244': [u'4864bp10', u'4864bp11'], 
u'245': [u'4864pb006L', u'4864pb006R'], 
u'246': [u'2362pb04', u'2362pb05'], 
u'247': [u'4215ap06', u'4215ap04'], 
u'248': [u'4215ap24', u'4215ap25'], 
u'249': [u'4215pb021', u'4215pb022'], 
u'250': [u'4215ap07', u'4215ap05'], 
u'251': [u'30117pb02L', u'30117pb02R'], 
u'252': [u'30117pb03L', u'30117pb03R'], 
u'253': [u'30117pb04L', u'30117pb04R'], 
u'254': [u'30117pb01', u'30117pb05'], 
u'255': [u'30116pb01', u'30116pb02'], 
u'256': [u'2468pb02', u'2468pb03'], 
u'257': [u'3245apx2', u'3245apx1'], 
u'258': [u'4070pb02', u'4070pb01'], 
u'259': [u'41855pb09', u'41855pb10'], 
u'401': [u'47847pb001L', u'47847pb001R'], 
u'418': [u'4460pb01', u'4460pb02'], 
u'419': [u'3010pb027', u'3010pb026'], 
u'420': [u'3010pb025', u'3010pb024'], 
u'421': [u'2341pb02', u'2341pb01'], 
u'439': [u'4286pb03', u'4286pb02'], 
u'440': [u'41748pb17', u'41747pb17'], 
u'472': [u'43710pb01', u'43711pb01'], 
u'473': [u'30363pb08', u'30363pb09'], 
u'474': [u'50305', u'50304'], 
u'475': [u'50955', u'50956'], 
u'512': [u'4286pb04', u'4286pb01'], 
u'546': [u'47397', u'47398'], 
u'572': [u'3193', u'3192'], 
u'598': [u'3933a', u'3934a'], 
u'606': [u'3822pb07', u'3821pb07'], 
u'620': [u'3939px1', u'3939px2'], 
u'621': [u'2431px18', u'2431px19'], 
u'622': [u'3069bpx57', u'3069bpx56'], 
u'643': [u'4215pb015', u'4215pb016'], 
u'678': [u'54384', u'54383'], 
u'680': [u'42061pb06', u'42060pb06'], 
u'681': [u'42061pb02', u'42060pb02'], 
u'682': [u'41748pb18', u'41747pb18'], 
u'683': [u'41768pb01', u'41767pb01'], 
u'684': [u'42061pb07', u'42060pb07'], 
u'685': [u'48933pb02', u'48933pb03'], 
u'686': [u'3622pb011', u'3622pb012'], 
u'687': [u'3010pb055L', u'3010pb055R'], 
u'688': [u'3008pb038', u'3008pb039'], 
u'689': [u'3822pb08', u'3821pb08'], 
u'690': [u'3822pb09', u'3821pb09'], 
u'691': [u'3822pb10', u'3821pb10'], 
u'692': [u'3189pb01', u'3188pb01'], 
u'693': [u'3193pb01', u'3192pb01'], 
u'694': [u'3193pb02', u'3192pb02'], 
u'695': [u'3195pb01', u'3194pb01'], 
u'696': [u'4864apx10', u'4864apx11'], 
u'697': [u'4215pb029', u'4215pb030'], 
u'700': [u'2362pb10', u'2362pb11'], 
u'701': [u'4286pb06', u'4286pb05'], 
u'702': [u'3678apb05', u'3678apb06'], 
u'703': [u'3678apb07', u'3678apb08'], 
u'704': [u'4460pb04', u'4460pb03'], 
u'705': [u'2340pb17L', u'2340pb17R'], 
u'706': [u'2340pb21L', u'2340pb21R'], 
u'707': [u'2340pb03', u'2340pb02'], 
u'708': [u'2340pb11', u'2340pb10'], 
u'709': [u'2340pb04', u'2340pb05'], 
u'710': [u'2340pb16', u'2340pb15'], 
u'711': [u'2340pb07', u'2340pb06'], 
u'712': [u'2340pb09', u'2340pb08'], 
u'714': [u'2431pb039', u'2431pb040'], 
u'727': [u'2431pb025', u'2431pb026'], 
u'728': [u'791pb01L', u'791pb01R'], 
u'766': [u'3004pb031L', u'3004pb031R'], 
u'768': [u'3010pb057L', u'3010pb057R'], 
u'769': [u'3009pb071L', u'3009pb071R'], 
u'770': [u'3009pb072L', u'3009pb072R'], 
u'771': [u'2873pb08L', u'2873pb08R'], 
u'772': [u'4286pb07L', u'4286pb07R'], 
u'773': [u'4286pb08L', u'4286pb08R'], 
u'774': [u'2340pb25L', u'2340pb25R'], 
u'775': [u'2340pb23L', u'2340pb23R'], 
u'776': [u'3004pb021L', u'3004pb021R'], 
u'777': [u'3004pb017L', u'3004pb017R']} 
+0

* baffled * 프로그램에 페이지를 처음으로로드 할 필요가 없습니까?그리고, 그것은'soup not defined' 오류를주고 있습니다. import를 사용해야 할 필요가 있습니까? –

+0

@BrickTop : 페이지 로딩 방법을 공유하지 않았습니다. 나는 '수프'를 정의하는 라인을 추가 할 것이다. –

+0

어떻게 출력을 텍스트 파일에 쓸 수 있습니까? –