2014-12-14 1 views
0

의 태그 들간에 dict를 가져와 HTML 요청의 태그들 사이에서 사용 가능한 데이터를 얻는 방법에 대해 궁금해 할 것입니다. 이것이 어떤 html에 있다고 가정합니다. 사용 방법을 어떻게 추출할까요?html 요청

<script type="text/javascript">window._sharedData = {"static_root":"\/\/d36xtkk24g8jdx.cloudfront.net\/bluebar\/a1968ef","platform":{"is_touch":false,"app_platform":"web"},"hostname":"instagram.com","entry_data":{"DesktopPPage":[{"canSeePrerelease":false,"viewer":null,"media":{"caption_is_edited":false,"code":"vF25LwCnL8","date":1415348305.0,"video_url":"http:\/\/videos-h-12.ak.instagram.com\/hphotos-ak-xap1\/10753251_876245142395032_328159772_n.mp4","caption":"2014 season teaser! Just a taste of some of the \ud83d\udd28\ud83d\udd28\ud83d\udd28 that got fumbled on \ud83d\udcf9 this season. Edit dropping fall 2017 @m.wilkie @sturhyssmith #snowboarding #springshred #bdpproteam #turoaparks #turoa #mtruapehu #seasonedit #wouldyouratherfightagoatwithahumanheadorahumanwithagoathead?","secure_video_url":"https:\/\/igcdn-videos-h-12-a.akamaihd.net\/hphotos-ak-xap1\/10753251_876245142395032_328159772_n.mp4","usertags":{"nodes":[]},"comments":{"nodes":[{"text":"Where do I buy tickets to the London premiere? #fanboy","viewer_can_delete":false,"id":"848487057151652334","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}},{"text":"It's invites only @jamesbutchernz @m.wilkie is choosing too so chances are slim unless your smoking hot with low self esteem.","viewer_can_delete":false,"id":"849353938720944684","user":{"username":"bobeykrebner","profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg"}},{"text":"It's lucky we all know I'm both of those. #easy","viewer_can_delete":false,"id":"849403857951420829","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}},{"text":"Last I heard you were smoking hot and had the self esteem of Kanye West @jamesbutchernz what changed?","viewer_can_delete":false,"id":"849671858500038887","user":{"username":"bobeykrebner","profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg"}},{"text":"You know what they say @bobeykrebner. Treat yourself like Kayne treats Kayne.","viewer_can_delete":false,"id":"849966794608898266","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}}]},"shared_by_author":true,"likes":{"count":41,"viewer_has_liked":false,"nodes":[{"user":{"username":"claytonbenson","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_52633025_75sq_1359351765.jpg"}},{"user":{"username":"snowrev","profile_pic_url":"https:\/\/igcdn-photos-d-a.akamaihd.net\/hphotos-ak-xaf1\/10735284_1474262932842435_1018554144_a.jpg"}},{"user":{"username":"shayning_","profile_pic_url":"https:\/\/igcdn-photos-f-a.akamaihd.net\/hphotos-ak-xaf1\/10817775_319647074907693_836092401_a.jpg"}},{"user":{"username":"paused_future","profile_pic_url":"https:\/\/igcdn-photos-f-a.akamaihd.net\/hphotos-ak-xpa1\/10809941_1580815445475533_469492417_a.jpg"}},{"user":{"username":"kris_tayl0r","profile_pic_url":"https:\/\/igcdn-photos-e-a.akamaihd.net\/hphotos-ak-xaf1\/10802916_384369668395220_1244229274_a.jpg"}},{"user":{"username":"crazyshuz","profile_pic_url":"https:\/\/igcdn-photos-h-a.akamaihd.net\/hphotos-ak-xfp1\/10787707_905860216092359_425635869_a.jpg"}},{"user":{"username":"titstatertots","profile_pic_url":"https:\/\/igcdn-photos-b-a.akamaihd.net\/hphotos-ak-xpf1\/10554089_855164584513369_706239607_a.jpg"}}]},"owner":{"username":"bobeykrebner","requested_by_viewer":false,"followed_by_viewer":false,"profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg","has_blocked_viewer":false,"id":"1459690667","is_private":false},"is_video":true,"id":"848325528968131324","display_src":"http:\/\/photos-e.ak.instagram.com\/hphotos-ak-xfp1\/10748245_307748359428196_942078105_n.jpg"},"__get_params":{},"staticRoot":"\/\/d36xtkk24g8jdx.cloudfront.net\/bluebar\/a1968ef","__query_string":"?","prerelease":false,"__path":"\/p\/vF25LwCnL8\/","shortcode":"vF25LwCnL8"}]},"country_code":"AU","config":{"viewer":null,"csrf_token":"0bfa16595bdacb5bcfcb94441d0fb7ab"}};</script> 

나는 기본적으로 스크립트 태그 내에서하지만 "window._sharedData =" 줄 끝에서 사용할 수있는 데이터를 가져 오는 방법을 알고 싶어요.

답변

1

HTML 구문 분석과 텍스트 조작을 조합하여 사용하십시오.

는 은 BeautifulSoup로는, 그 후에 당신은 <script> 태그 텍스트 내용을 추출하고, 자바 스크립트 객체 정의 밖으로 분할 할 수 있습니다, 구문 분석에 도움이 될

: 마지막 줄은 태그의 문자열 내용을 가져 와서

from bs4 import BeautifulSoup 
import re 

soup = BeautifulSoup(html_page_source) 
script_tag = soup.find('script', text=re.compile('window\._sharedData')) 
shared_data = script_tag.string.partition('=')[-1].strip(' ;') 

을, 모든를 분할 처음으로 =까지 모든 앞뒤 공백과 세미콜론을 제거합니다. JSON으로 결과 문자열을로드하는 것을 포함

데모 :

>>> from bs4 import BeautifulSoup 
>>> import re 
>>> soup = BeautifulSoup('''\ 
... <script type="text/javascript">window._sharedData = {"static_root":"\/\/d36xtkk24g8jdx.cloudfront.net\/bluebar\/a1968ef","platform":{"is_touch":false,"app_platform":"web"},"hostname":"instagram.com","entry_data":{"DesktopPPage":[{"canSeePrerelease":false,"viewer":null,"media":{"caption_is_edited":false,"code":"vF25LwCnL8","date":1415348305.0,"video_url":"http:\/\/videos-h-12.ak.instagram.com\/hphotos-ak-xap1\/10753251_876245142395032_328159772_n.mp4","caption":"2014 season teaser! Just a taste of some of the \ud83d\udd28\ud83d\udd28\ud83d\udd28 that got fumbled on \ud83d\udcf9 this season. Edit dropping fall 2017 @m.wilkie @sturhyssmith #snowboarding #springshred #bdpproteam #turoaparks #turoa #mtruapehu #seasonedit #wouldyouratherfightagoatwithahumanheadorahumanwithagoathead?","secure_video_url":"https:\/\/igcdn-videos-h-12-a.akamaihd.net\/hphotos-ak-xap1\/10753251_876245142395032_328159772_n.mp4","usertags":{"nodes":[]},"comments":{"nodes":[{"text":"Where do I buy tickets to the London premiere? #fanboy","viewer_can_delete":false,"id":"848487057151652334","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}},{"text":"It's invites only @jamesbutchernz @m.wilkie is choosing too so chances are slim unless your smoking hot with low self esteem.","viewer_can_delete":false,"id":"849353938720944684","user":{"username":"bobeykrebner","profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg"}},{"text":"It's lucky we all know I'm both of those. #easy","viewer_can_delete":false,"id":"849403857951420829","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}},{"text":"Last I heard you were smoking hot and had the self esteem of Kanye West @jamesbutchernz what changed?","viewer_can_delete":false,"id":"849671858500038887","user":{"username":"bobeykrebner","profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg"}},{"text":"You know what they say @bobeykrebner. Treat yourself like Kayne treats Kayne.","viewer_can_delete":false,"id":"849966794608898266","user":{"username":"jamesbutchernz","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_1052126311_75sq_1391324963.jpg"}}]},"shared_by_author":true,"likes":{"count":41,"viewer_has_liked":false,"nodes":[{"user":{"username":"claytonbenson","profile_pic_url":"https:\/\/instagramimages-a.akamaihd.net\/profiles\/profile_52633025_75sq_1359351765.jpg"}},{"user":{"username":"snowrev","profile_pic_url":"https:\/\/igcdn-photos-d-a.akamaihd.net\/hphotos-ak-xaf1\/10735284_1474262932842435_1018554144_a.jpg"}},{"user":{"username":"shayning_","profile_pic_url":"https:\/\/igcdn-photos-f-a.akamaihd.net\/hphotos-ak-xaf1\/10817775_319647074907693_836092401_a.jpg"}},{"user":{"username":"paused_future","profile_pic_url":"https:\/\/igcdn-photos-f-a.akamaihd.net\/hphotos-ak-xpa1\/10809941_1580815445475533_469492417_a.jpg"}},{"user":{"username":"kris_tayl0r","profile_pic_url":"https:\/\/igcdn-photos-e-a.akamaihd.net\/hphotos-ak-xaf1\/10802916_384369668395220_1244229274_a.jpg"}},{"user":{"username":"crazyshuz","profile_pic_url":"https:\/\/igcdn-photos-h-a.akamaihd.net\/hphotos-ak-xfp1\/10787707_905860216092359_425635869_a.jpg"}},{"user":{"username":"titstatertots","profile_pic_url":"https:\/\/igcdn-photos-b-a.akamaihd.net\/hphotos-ak-xpf1\/10554089_855164584513369_706239607_a.jpg"}}]},"owner":{"username":"bobeykrebner","requested_by_viewer":false,"followed_by_viewer":false,"profile_pic_url":"https:\/\/igcdn-photos-g-a.akamaihd.net\/hphotos-ak-xpf1\/10584664_742398385822158_510451676_a.jpg","has_blocked_viewer":false,"id":"1459690667","is_private":false},"is_video":true,"id":"848325528968131324","display_src":"http:\/\/photos-e.ak.instagram.com\/hphotos-ak-xfp1\/10748245_307748359428196_942078105_n.jpg"},"__get_params":{},"staticRoot":"\/\/d36xtkk24g8jdx.cloudfront.net\/bluebar\/a1968ef","__query_string":"?","prerelease":false,"__path":"\/p\/vF25LwCnL8\/","shortcode":"vF25LwCnL8"}]},"country_code":"AU","config":{"viewer":null,"csrf_token":"0bfa16595bdacb5bcfcb94441d0fb7ab"}};</script> 
... ''') 
>>> script_tag = soup.find('script', text=re.compile('window\._sharedData')) 
>>> shared_data = script_tag.string.partition('=')[-1].strip(' ;') 
>>> import json 
>>> result = json.loads(shared_data) 
>>> from pprint import pprint 
>>> pprint(result) 
{u'config': {u'csrf_token': u'0bfa16595bdacb5bcfcb94441d0fb7ab', 
      u'viewer': None}, 
u'country_code': u'AU', 
u'entry_data': {u'DesktopPPage': [{u'__get_params': {}, 
            u'__path': u'/p/vF25LwCnL8/', 
            u'__query_string': u'?', 
            u'canSeePrerelease': False, 
            u'media': {u'caption': u'2014 season teaser! Just a taste of some of the \U0001f528\U0001f528\U0001f528 that got fumbled on \U0001f4f9 this season. Edit dropping fall 2017 @m.wilkie @sturhyssmith #snowboarding #springshred #bdpproteam #turoaparks #turoa #mtruapehu #seasonedit #wouldyouratherfightagoatwithahumanheadorahumanwithagoathead?', 
               u'caption_is_edited': False, 
               u'code': u'vF25LwCnL8', 
               u'comments': {u'nodes': [{u'id': u'848487057151652334', 
                     u'text': u'Where do I buy tickets to the London premiere? #fanboy', 
                     u'user': {u'profile_pic_url': u'https://instagramimages-a.akamaihd.net/profiles/profile_1052126311_75sq_1391324963.jpg', 
                        u'username': u'jamesbutchernz'}, 
                     u'viewer_can_delete': False}, 
                     {u'id': u'849353938720944684', 
                     u'text': u"It's invites only @jamesbutchernz @m.wilkie is choosing too so chances are slim unless your smoking hot with low self esteem.", 
                     u'user': {u'profile_pic_url': u'https://igcdn-photos-g-a.akamaihd.net/hphotos-ak-xpf1/10584664_742398385822158_510451676_a.jpg', 
                        u'username': u'bobeykrebner'}, 
                     u'viewer_can_delete': False}, 
                     {u'id': u'849403857951420829', 
                     u'text': u"It's lucky we all know I'm both of those. #easy", 
                     u'user': {u'profile_pic_url': u'https://instagramimages-a.akamaihd.net/profiles/profile_1052126311_75sq_1391324963.jpg', 
                        u'username': u'jamesbutchernz'}, 
                     u'viewer_can_delete': False}, 
                     {u'id': u'849671858500038887', 
                     u'text': u'Last I heard you were smoking hot and had the self esteem of Kanye West @jamesbutchernz what changed?', 
                     u'user': {u'profile_pic_url': u'https://igcdn-photos-g-a.akamaihd.net/hphotos-ak-xpf1/10584664_742398385822158_510451676_a.jpg', 
                        u'username': u'bobeykrebner'}, 
                     u'viewer_can_delete': False}, 
                     {u'id': u'849966794608898266', 
                     u'text': u'You know what they say @bobeykrebner. Treat yourself like Kayne treats Kayne.', 
                     u'user': {u'profile_pic_url': u'https://instagramimages-a.akamaihd.net/profiles/profile_1052126311_75sq_1391324963.jpg', 
                        u'username': u'jamesbutchernz'}, 
                     u'viewer_can_delete': False}]}, 
               u'date': 1415348305.0, 
               u'display_src': u'http://photos-e.ak.instagram.com/hphotos-ak-xfp1/10748245_307748359428196_942078105_n.jpg', 
               u'id': u'848325528968131324', 
               u'is_video': True, 
               u'likes': {u'count': 41, 
                  u'nodes': [{u'user': {u'profile_pic_url': u'https://instagramimages-a.akamaihd.net/profiles/profile_52633025_75sq_1359351765.jpg', 
                       u'username': u'claytonbenson'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-d-a.akamaihd.net/hphotos-ak-xaf1/10735284_1474262932842435_1018554144_a.jpg', 
                       u'username': u'snowrev'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xaf1/10817775_319647074907693_836092401_a.jpg', 
                       u'username': u'shayning_'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/10809941_1580815445475533_469492417_a.jpg', 
                       u'username': u'paused_future'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-e-a.akamaihd.net/hphotos-ak-xaf1/10802916_384369668395220_1244229274_a.jpg', 
                       u'username': u'kris_tayl0r'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xfp1/10787707_905860216092359_425635869_a.jpg', 
                       u'username': u'crazyshuz'}}, 
                    {u'user': {u'profile_pic_url': u'https://igcdn-photos-b-a.akamaihd.net/hphotos-ak-xpf1/10554089_855164584513369_706239607_a.jpg', 
                       u'username': u'titstatertots'}}], 
                  u'viewer_has_liked': False}, 
               u'owner': {u'followed_by_viewer': False, 
                  u'has_blocked_viewer': False, 
                  u'id': u'1459690667', 
                  u'is_private': False, 
                  u'profile_pic_url': u'https://igcdn-photos-g-a.akamaihd.net/hphotos-ak-xpf1/10584664_742398385822158_510451676_a.jpg', 
                  u'requested_by_viewer': False, 
                  u'username': u'bobeykrebner'}, 
               u'secure_video_url': u'https://igcdn-videos-h-12-a.akamaihd.net/hphotos-ak-xap1/10753251_876245142395032_328159772_n.mp4', 
               u'shared_by_author': True, 
               u'usertags': {u'nodes': []}, 
               u'video_url': u'http://videos-h-12.ak.instagram.com/hphotos-ak-xap1/10753251_876245142395032_328159772_n.mp4'}, 
            u'prerelease': False, 
            u'shortcode': u'vF25LwCnL8', 
            u'staticRoot': u'//d36xtkk24g8jdx.cloudfront.net/bluebar/a1968ef', 
            u'viewer': None}]}, 
u'hostname': u'instagram.com', 
u'platform': {u'app_platform': u'web', u'is_touch': False}, 
u'static_root': u'//d36xtkk24g8jdx.cloudfront.net/bluebar/a1968ef'} 
+0

전설, 그건 내가 할 찾고 있었는지 정확히! 대답을 수락하기 위해 3 분을 기다리고 있습니다! 감사합니다. 친구 – Johnny

+0

어떻게하면이 URL에서 해당 태그를 찾을 수 있습니까? 답변 : http://instagram.com/p/vF25LwCnL8/ – Johnny

+1

@Johnny : 찾고있는 스크립트 태그는 7 번째입니다. 'soup.find_all ('script') [6]'하면됩니다. –