Python과 BeautifulSoup로는, 'A'는

여기 (맛있는에서) HTML 코드 조각의 발견하지 : 나는 모든 링크를 찾기 위해 노력하고 어디 클래스 = "inlinesave 행동"야Python과 BeautifulSoup로는, 'A'는

<h4> 
<a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anonymous Referers &amp; Anti-Bot Protection</a> 
<span class="saverem"> 
    <em class="bookmark-actions"> 
    <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Generate%20Secure%20Links%20with%20Anonymous%20Referers%20%26%20Anti-Bot%20Protection&amp;jump=%2Fdux&amp;key=fFS4QzJW2lBf4gAtcrbuekRQfTY-&amp;original_user=dux&amp;copyuser=dux&amp;copytags=web+apps+url+security+generator+shortener+anonymous+links">SAVE</a></strong> 
    </em> 
</span> 
</h4>

. 코드는 다음과 같습니다.

sock = urllib2.urlopen('http://delicious.com/theuser') 
html = sock.read() 
soup = BeautifulSoup(html) 
tags = soup.findAll('a', attrs={'class':'inlinesave action'}) 
print len(tags)

하지만 아무 것도 찾을 수 없습니다.

의견이 있으십니까?

감사

출처

2009-11-25 pns

. findAll ('a', attrs = { 'class': 'inlinesave'})'대신에? –

음 ... 작동합니다! 이유에 대한 합리적인 설명 ?? – pns

여러 클래스 속성은 공백으로 구분됩니다. 문제의 앵커는 클래스에 "inlinesave"와 "action"을 할당했습니다. 클래스 이름을 찾는 것이 효과가있는 것 같습니다. – Haes

정확히 두 클래스와 앵커 찾고 싶을 경우는 정규 표현식을 사용해야합니다, 나는 생각할 것 :이 정규 표현식은 '이겼다

tags = soup.findAll('a', attrs={'class': re.compile(r'\binlinesave\b.*\baction\b')})

염두에 두십시오 클래스 이름의 순서가 바뀌면 작동하지 않습니다().

다음 명령문은 모든 경우에 작동합니다 (이 IMO 추한 외모에도 불구하고.) :

soup.findAll('a', 
    attrs={'class': 
     re.compile(r'\baction\b.*\binlinesave\b|\binlinesave\b.*\baction\b') 
    })

출처

2009-11-25 13:09:58 Haes

정규 표현식이 정확하게 일치하는 경우에만 작동하므로 개인적으로는 가지 않습니다 (클래스 사이에 여분의 공백이 있으면 어떻게 될까요? 클래스 사이에 다른 클래스가 있다면 어떨까요). 그래도 가능성이있는 모든 사례를 일치시키기 위해 약간 조여 둘 수 있습니다. –

네 말이 맞아, 그에 따라 대답을 편집했다. – Haes

이것은 https://bugs.launchpad.net/beautifulsoup/+bug/410304의 버그로 설명되어 있습니다. 앞으로도 문제를 해결할 수 있을까요? – GmonC

파이썬 문자열 방법

html=open("file").read() 
for item in html.split("<strong>"): 
    if "class" in item and "inlinesave action" in item: 
     url_with_junk = item.split('href="')[1] 
     m = url_with_junk.index('">') 
     print url_with_junk[:m]

출처

2009-11-25 13:46:10 ghostdog74

그 문제가 verion 3.1.0에서 해결 될 수 있습니다, 내가 할 수있는 구문 분석,

>>> html="""<h4> 
... <a rel="nofollow" class="taggedlink " href="http://imfy.us/" >Generate Secure Links with Anony 
... <span class="saverem"> 
... <em class="bookmark-actions"> 
...  <strong><a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Gen 
... </em> 
... </span> 
... </h4>""" 
>>> 
>>> from BeautifulSoup import BeautifulSoup 
>>> soup = BeautifulSoup(html) 
>>> tags = soup.findAll('a', attrs={'class':'inlinesave action'}) 
>>> print len(tags) 
1 
>>> tags 
[<a class="inlinesave action" href="/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Generate%20Secure% 
>>>

나는 BeautifulSoup 2.1.1에서도 시도해 봤지만 전혀 작동하지 않습니다.

출처

2009-11-25 14:05:43 YOU

당신은 약간 앞으로 진행하여 대한 파싱을 할 수 있습니다

from pyparsing import makeHTMLTags, withAttribute 

htmlsrc="""<h4>... etc.""" 

atag = makeHTMLTags("a")[0] 
atag.setParseAction(withAttribute(("class","inlinesave action"))) 

for result in atag.searchString(htmlsrc): 
    print result.href

은 (에 냈다 긴 결과 출력 '...') 제공합니다 :

당신이`태그 = 수프를 사용하면 어떻게됩니까

/save?url=http%3A%2F%2Fimfy.us%2F&amp;title=Genera...+anonymous+links

출처

2009-11-25 17:12:49 PaulMcG

Python과 BeautifulSoup로는, 'A'는

답변

관련 문제