인도 특허 웹 사이트에서 특허 데이터를 웹 스크랩

Indian patent search website에 대한 웹 스크레이퍼를 작성하여 특허 관련 데이터를 얻으려고합니다. 지금까지 가지고있는 코드는 다음과 같습니다.인도 특허 웹 사이트에서 특허 데이터를 웹 스크랩

#import the necessary modules 
import urllib2 
#import the beautifulsoup functions to parse the data 
from bs4 import BeautifulSoup 

#mention the website that you are trying to scrape 
patentsite="http://ipindiaservices.gov.in/publicsearch/" 

#Query the website and return the html to the variable 'page' 
page = urllib2.urlopen(patentsite) 

#Parse the html in the 'page' variable, and store it in Beautiful Soup format 
soup = BeautifulSoup(page) 

print soup

불행히도, 인도 특허 웹 사이트는 견고하지 못하거나이 점에 관해 더 진행할 방법이 확실하지 않습니다.

위 코드의 결과입니다.

내가, 내가 회사 이름을 제공한다고 가정한다주고 싶은 것은, 스크레이퍼는 특정 회사에 대한 모든 특허를 얻어야한다. 스크레이퍼가 특허를 찾을 때 사용할 입력 자료를 제공하는 것과 같이이 부분을 올바르게 얻을 수 있다면 다른 일을하고 싶습니다. 그러나 나는 더 이상 진행할 수없는 부분에 갇혀있다.

이 데이터를 얻는 방법에 대한 설명이 있으면 대단히 감사하겠습니다.

출처

2016-09-06 Annapoornima Koppad

글쎄, 당신은 요청한 html 파일을 가지고 있습니다. 그러나이 페이지는 모든 것이 JavaScript ('app.js'에서)를 통해 처리되는 webapp로 만들어 진 것 같습니다. 그래서 당신의 접근 방식은 거의 효과가 없을 것입니다. 해당 웹 사이트에서 사용할 API를 제공하는지 살펴보고 싶을 수도 있습니다. – UnholySheep

예, 그런 종류의 정보를 찾았습니다. 그것은 거기에있는 것 같지 않습니다. 나는 온라인 웹 스크래퍼 몇 개를 시험해 보았다. 나는이 웹 사이트를 긁을 수있는 방법이 없습니까? –

내가 말했듯이, 그것은 웹 사이트 (자바 스크립트를 통해 완전히 구동 됨)보다 웹 애플리케이션에 가깝습니다. Selenium을 사용하여 무언가를 할 수는 있지만 결코 사용한 적이 없습니다. – UnholySheep

요청만으로이 작업을 수행 할 수 있습니다. 포스트는 우리가 로 time.time하여 만든 타임 스탬프 하나 PARAMrc_와 http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php이다. 당신이 *AND**OR* 또는 *NOT*로 선택했는지 여부

는 "field[]"의 각 값은, "operator[]"에 "fieldvalue[]"과 차례로 경기에서 각각 우리가 값 (들)의 배열을 전달하는 각 키 지정 후 []을 일치해야합니다 그 아무것도없이 작동합니다 :

data = { 
    "publication_type_published": "on", 
    "publication_type_granted": "on", 
    "fieldDate": "APD", 
    "datefieldfrom": "19120101", 
    "datefieldto": "20160906", 
    "operatordate": " AND ", 
    "field[]": ["PA"], # claims,.description, patent-number codes go here 
    "fieldvalue[]": ["chris*"], # matching values for ^^ go here 
    "operator[]": [" AND "], # matching sql logic for ^^ goes here 
    "page": "1", # gives you next page results 
    "start": "0", # not sure what effect this actually has. 
    "limit": "25"} # not sure how this relates as len(r.json()[u'record']) stays 25 regardless 

import requests 
from time import time 

post = "http://ipindiaservices.gov.in/publicsearch/resources/webservices/search.php?_dc={}".format(
    str(time()).replace(".", "")) 

with requests.Session() as s: 
    s.get("http://ipindiaservices.gov.in/publicsearch/") 
    s.headers.update({"X-Requested-With": "XMLHttpRequest"}) 
    r = s.post(post, data=data) 
    print(r.json())

출력은 다음과 같이 것, 나는 그것을 추가 할 수 없습니다 게시물에 너무 많은 데이터가 모두 같이

,787,564,135,976을 당신의 특허 정보입니다

{u'Publication_Status': u'Published', u'appDate': u'2015/01/27', u'pubDate': u'2015/06/26', u'title': u'CORRUGATED PALLET', u'sourceID': u'inpat', u'abstract': u'\n A corrugated paperboard pallet is produced from two flat blanks which comprise a pallet top and a pallet bottom. The two blanks are each folded to produce only two parallel vertically extending double thickness ribs&nbsp;three horizontal panels&nbsp;two vertical side walls and two horizontal flaps. The ribs of the pallet top and pallet bottom lock each other from opening in the center of the pallet by intersecting perpendicularly with notches in the ribs. The horizontal flaps lock the ribs from opening at the edges of the pallet by intersecting perpendicularly with notches&nbsp;and the vertical sidewalls include vertical flaps that open inward defining fork passages whereby the vertical flaps lock said horizontal flaps from opening.\n ', u'Assignee': u'OLVEY Douglas A., SKETO James L., GUMBERT Sean G., DANKO Joseph J., GABRYS Christopher W., ', u'field_of_invention': u'FI10', u'publication_no': u'26/2015', u'patent_no': u'', u'application_no': u'642/DELNP/2015', u'UCID': u'WVJ4NVVIYzFLcUQvVnJsZGczcVRmSS96Vkh3NWsrS1h3Qk43S2xHczJ2WT0%3D', u'Publication_Type': u'A'}

: 당신이 레코드 키를 사용하는 경우 73,210

당신은 같은 dicts의 목록을 얻을.

우리가 우리의 브라우저에서 몇 가지 값을 선택하면 당신은 당신이 볼 수 있도록의 값이 모두 fieldValue의이 필드 및 운영자 라인업, AND 기본입니다 볼 수있는 모든 옵션 :

그래서 당신이 원하는 것을 선택하고 게시, 코드를 알아낼.

출처

2016-09-06 23:22:29

이것은 굉장합니다! 감사. 코드를 작성한 다음 코드를 작성합니다. 엄청 고마워. –

걱정할 필요가 없습니다. 원하는 값을 선택하여 목록에 정렬하고 URL에 게시하면 json 형식으로 원하는 것을 얻을 수 있습니다. –

인도 특허 웹 사이트에서 특허 데이터를 웹 스크랩

답변

관련 문제