나는이 웹 사이트를 청소하고 모든 단어를 얻으려고합니다. 하지만 발전기를 사용하면 목록을 사용하는 것보다 더 많은 단어를 얻을 수 있습니다. 또한,이 단어는 일관성이 없습니다. 때로는 1 단어 이상, 때로는 없음, 때로는 30 단어 이상을가집니다. 파이썬 문서에서 생성자에 대해 읽었으며 생성자에 대한 몇 가지 질문을 살펴 보았습니다. 내가 이해하는 것은 다를 것이 없다는 것이다. 나는 두포 밑에서 무슨 일이 벌어지고 있는지 이해하지 못한다. 나는 파이썬 3.6을 사용하고있다. 또한 나는 Generator Comprehension different output from list comprehension?을 읽었지 만 상황을 이해할 수는 없습니다.같은 코드는 목록의 내포물이나 생성자가 있는지에 따라 출력이 달라집니다.
이것은 제너레이터의 첫 번째 기능입니다.
def text_cleaner1(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = (line.strip() for line in text.splitlines()) # break into lines
print(type(lines))
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
print(type(chunks))
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
이것은 목록 내포물이있는 두 번째 기능입니다.
def text_cleaner2(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = [line.strip() for line in text.splitlines()] # break into lines
chunks = [phrase.strip() for line in lines for phrase in line.split(" ")] # break multi-headlines into a line each
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
그리고이 코드는 무작위로 나에게 다른 결과를 제공합니다. 즉시 코드를 실행하지 않고 나중에 결과가 필요할 때 그것을 실행 -
text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")
필자는 코드를 몇 번 실행했으며 두 기능 모두에 대해 항상 동일한 결과를 얻습니다. – furas
'.join()'내부에서'chunk_space (chunk)'를 사용하는 대신에 .join()'('''공백 넣기 .join()')에 빈 문자열 대신 공백을 사용할 수 있습니다. – furas