2016-11-16 4 views
0

나는 이것을 파악하려고 애 쓰고 있습니다. 나는 각각의 검토를 위해 다음 항목으로 팬더 DataFrame 객체를 생성해야합니다파이썬 텍스트 파일의 특정 데이터 읽기

  • 제품 ID
  • 이 리뷰가 도움이 투표 사람들의 수
  • 평가이 리뷰를 평가 사람들의
  • 총 수 제품의
  • 검토
  • 텍스트
  • 사람도 그냥 날 모든 생산을 인쇄하는 방법에 시작할 수 있도록 도움을 드릴 수 있습니다

t/productID 라인, 그게 인정 될 것이다. 여기

내 텍스트 파일의 샘플입니다 : 내가 읽고 싶은 생각에 질문을 이해하면

product/productId: B001E4KFG0 
review/userId: A3SGXH7AUHU8GW 
review/profileName: delmartian 
review/helpfulness: 1/1 
review/score: 5.0 
review/time: 1303862400 
review/summary: Good Quality Dog Food 
review/text: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most. 

product/productId: B00813GRG4 
review/userId: A1D87F6ZCVE5NK 
review/profileName: dll pa 
review/helpfulness: 0/0 
review/score: 1.0 
review/time: 1346976000 
review/summary: Not as Advertised 
review/text: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo". 

product/productId: B000LQOCH0 
review/userId: ABXLMWJIXXAIN 
review/profileName: Natalia Corres "Natalia Corres" 
review/helpfulness: 1/1 
review/score: 4.0 
review/time: 1219017600 
review/summary: "Delight" says it all 
review/text: This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch. 
+0

"투표 한 사람의 수"와 "투표 한 사람의 수"는 어느 필드입니까? – wwii

답변

1

(죄송합니다 제가이 사이트를 입력 할 때 제대로 포맷하는 방법을 모른다) 당신이 쓴 구조를 가진 파일에서. 당신은이 같은 모든 개체에 액세스 할 수

[ 
{'review/text': 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.', 'review/profileName': 'delmartian', 'review/summary': 'Good Quality Dog Food', 'product/productId': 'B001E4KFG0', 'review/score': '5.0', 'review/time': '1303862400', 'review/helpfulness': '1/1', 'review/userId': 'A3SGXH7AUHU8GW'}, 
{'review/text': 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".', 'review/profileName': 'dll pa', 'review/summary': 'Not as Advertised', 'product/productId': 'B00813GRG4', 'review/score': '1.0', 'review/time': '1346976000', 'review/helpfulness': '0/0', 'review/userId': 'A1D87F6ZCVE5NK'}, 
{'review/text': 'bla blas', 'review/profileName': 'Natalia Corres "Natalia Corres"', 'review/summary': '"Delight" says it all', 'product/productId': 'B000LQOCH0', 'review/score': '4.0', 'review/time': '1219017600', 'review/helpfulness': '1/1', 'review/userId': 'ABXLMWJIXXAIN'} 
] 

:

#Opening your file 
your_file = open('file.txt') 

#Reading every line 
reviews = your_file.readlines() 

reviews_array = [] 
dictionary = {} 

#We are going through every line and skip it when we see that it's a blank line 
for review in reviews: 
    this_line = review.split(":") 
    if len(this_line) > 1: 
     #The blank lines are less than 1 in length after the split 
     dictionary[this_line[0]] = this_line[1].strip() 
     #Every first part before ":" is the key of the dictionary, and the second part id the content. 
    else: 
     #If a blank like was found lets save the object in the array and reset it 
     #for the next review 
     reviews_array.append(dictionary) 
     dictionary = {} 

#Append the last object because it goes out the last else 
reviews_array.append(dictionary) 

print(reviews_array) 

이 코드는 다음과 같이 인쇄됩니다 : 당신은 모든 검토가 사전 인 상태 배열을 생성합니다 다음 코드를 사용할 수 있습니다

for r in reviews_array: 
    print(r['review/userId']) 

그리고 당신은이 결과가됩니다

A3SGXH7AUHU8GW 
A1D87F6ZCVE5NK 
ABXLMWJIXXAIN 
+0

'''df = pd.DataFrame (리뷰 _ 배열)''' – wwii

0

여기에 시작입니다. 더 많은 로직과 텍스트가 필요하므로 필드/열 몇 개를 해독 할 수 없었습니다. 다른 답변과 마찬가지로 : 텍스트를 사전 키로 구문 분석 : 값 쌍 - 정규 표현식을 사용하여 쌍을 찾습니다.

import collections, re 

fields = {'productId':'Product ID', 'score':'Rating', 
      'helpfulness':'Number Voting', 'text':'Review'} 

pattern = r'/([^:]*):\s?(.*)' 
kv = re.compile(pattern) 

data = collections.defaultdict(list) 
with open('file.txt') as f: 
    reviews = f.read() 

for match in kv.finditer(reviews): 
    key, value = match.groups() 
    if key in fields: 
     data[fields[key]].append(value) 

df = pd.DataFrame.from_dict(data)