웹 페이지에서 피드 링크 (atom, rss 등) 자동 추출

나는 거대한 URL 목록을 가지고 있으며 피드 태그가있는 경우 피드 URL을 뱉어 야하는 python 스크립트에 피드를 제공합니다. 도움이 될 수있는 API 라이브러리 또는 코드가 있습니까?웹 페이지에서 피드 링크 (atom, rss 등) 자동 추출

출처

2011-10-25 Max

I 둘째 와플 역설. 코드 나는 보통 사용

이

from BeautifulSoup import BeautifulSoup as parser 

def detect_feeds_in_HTML(input_stream): 
    """ examines an open text stream with HTML for referenced feeds. 

    This is achieved by detecting all ``link`` tags that reference a feed in HTML. 

    :param input_stream: an arbitrary opened input stream that has a :func:`read` method. 
    :type input_stream: an input stream (e.g. open file or URL) 
    :return: a list of tuples ``(url, feed_type)`` 
    :rtype: ``list(tuple(str, str))`` 
    """ 
    # check if really an input stream 
    if not hasattr(input_stream, "read"): 
     raise TypeError("An opened input *stream* should be given, was %s instead!" % type(input_stream)) 
    result = [] 
    # get the textual data (the HTML) from the input stream 
    html = parser(input_stream.read()) 
    # find all links that have an "alternate" attribute 
    feed_urls = html.findAll("link", rel="alternate") 
    # extract URL and type 
    for feed_link in feed_urls: 
     url = feed_link.get("href", None) 
     # if a valid URL is there 
     if url: 
      result.append(url) 
    return result

출처

2011-10-25 07:20:14 PhilS

나는 기존의 라이브러리를 모르겠지만, 원자 또는 RSS 피드는 일반적으로 같은 <head> 섹션에서 <link> 태그로 표시됩니다 :

<link rel="alternative" type="application/rss+xml" href="http://link.to/feed"> 
<link rel="alternative" type="application/atom+xml" href="http://link.to/feed">

간단한 방법이 URL의로 다운로드 및 구문 분석된다 lxml.html과 같은 HTML 구문 분석기이며 href 속성이 관련 <link> 태그로 표시됩니다.

출처

2011-10-25 03:23:49 Avaris

피드에있는 모든 링크는? 그들은 모두 href 또는 link 태그에있을 것입니다 경우 (예를 들어, http://.../의 형태로 모든 링크가? 있습니까 당신은 알고 계십니까 이들 피드의 정보를 잘 형성 방법에 있습니까 따라 다른 피드가 될 것인가? 등), 간단한 정규 표현식에서부터 피드에서 링크를 추출하는 직선적 인 구문 분석 모듈에 이르기까지 어떤 것도 권하고 싶습니다.

파싱 모듈에 관한 한, beautiful soup 만 권장 할 수 있습니다. 비록 위에서 언급 한 경우에있어서 최고의 파서 (parser)조차도 갈 것입니다 만, 데이터의 모든 링크가 다른 피드와 연결될 것이라고 보장 할 수 없다면 말입니다. 당신은 스스로 추가 크롤링과 프로빙을해야합니다. HTML을 구문 분석 Beautiful Soup을 추천하고 피드가 참조하는 < 링크에 rel = "대체"> 태그를 얻기에

출처

2011-10-25 03:27:53

feedfinder 없습니다 :

>>> import feedfinder 
>>> 
>>> feedfinder.feed('scripting.com') 
'http://scripting.com/rss.xml' 
>>> 
>>> feedfinder.feeds('scripting.com') 
['http://delong.typepad.com/sdj/atom.xml', 
'http://delong.typepad.com/sdj/index.rdf', 
'http://delong.typepad.com/sdj/rss.xml'] 
>>>

출처

2013-03-22 08:46:08

feedfinder가 더 이상 유지,하지만 지금은 ['feedfinder2']이 (https://pypi.python.org/pypi/ 피드 파인더 2). – Scarabee

웹 페이지에서 피드 링크 (atom, rss 등) 자동 추출

답변

관련 문제