2016-12-25 1 views
2

스파크 데이터 프레임이 있습니다. HTML 구문 분석 라이브러리를 사용하여 텍스트 열을 구문 분석 한 다음 구문 분석 된 HTML을 새로운 RDD로 두 개의 다른 열과 함께 저장하기 위해 데이터 프레임의 각 행에 매핑 함수를 사용하고 있습니다.스파크 데이터 프레임에서 html을 구문 분석 할 때 오류가 발생했습니다.

결국 RDD를 새로운 Spark Dataframe으로 저장하려고합니다. 여기에 같은 코드가 있습니다.

def htmlParsing(x): 
    """ This function takes the input text and cleans the HTML tags from it 

    """ 

    from bs4 import BeautifulSoup 
    row=x.asDict() 
    textcleaned='' 
    souptext=BeautifulSoup(row['desc']) 
    #souptext=BeautifulSoup(text) 
    p_tags=souptext.find_all('p') 
    for p in p_tags: 
     if p.string: 
      textcleaned+=p.string 
    ret_list= (int(row['id']),row['title'],textcleaned) 
    return ret_list 


ret_list=sdf_rss.map(htmlParsing) 

sdf_cleaned=sqlContext.createDataFrame(ret_list,['id','title','desc']) 
sdf_cleaned.count() 

ret_list.take (2)를 수행 할 때 매핑 결과가 올바르게 표시됩니다. sdf_cleaned.show() 같은 경우에도 마찬가지입니다.

매핑 기능은 올바른 RDD를 얻을 때 올바르게 작동합니다. 돌려 주어지는 매핑 함수의 RDD의 결과를 참조 해주세요.

[(-33753621, 
    u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', 
    u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."), 
(-761323061, 
    u'Teen sexting is prompting an overhaul in child pornography laws', 
    u"Rampant teen sexting has left politicians and law enforcement authorities around the country struggling to find some kind of legal middle ground between prosecuting students for child porn and letting them off the hook.Most states consider sexually explicit images of minors to be child pornography, meaning even teenagers who share nude selfies among themselves can, in theory at least, be hit with felony charges that can carry heavy prison sentences and require lifetime registration as a sex offender.Many authorities consider that overkill, however, and at least 20 states have adopted sexting laws with less-serious penalties, mostly within the past five years. Eleven states have made sexting between teens a misdemeanor; in some of those places, prosecutors can require youngsters to take courses on the dangers of social media instead of charging them with a crime.Hawaii passed a 2012 law saying youths can escape conviction if they take steps to delete explicit photos. Arkansas adopted a 2013 law sentencing first-time youth sexters to eight hours of community service. New Mexico last month removed criminal penalties altogether in such cases.At least 12 other states are considering sexting laws this year, many to create new a category of crime that would apply to young people.But one such proposal in Colorado has revealed deep divisions about how to treat the phenomenon. Though prosecutors and researchers agree that felony sex crimes shouldn't apply to a pair of 16-year-olds sending each other selfies, they disagree about whether sexting should be a crime at all.Colorado's bill was prompted by a scandal last year at a Canon City high school where more than 100 students were found with explicit images of other teens. The news sent shockwaves through the city of 16,000. Dozens of students were suspended, and the football team forfeited the final game of the season.Fremont County prosecutors ultimately decided against filing any criminal charges, saying Colorado law doesn't properly distinguish between adult sexual predators and misbehaving teenagers.In a similar case last year out Fayetteville, North Carolina, two dating teens who exchanged nude selfies at age 16 were charged as adults with a felony \u2014 sexual exploitation of a minor. After an uproar, the cha"), 

그러나 두 경우 모두 계산할 때 오류가 발생합니다. 당신이 NULL (SQL)에 대한 정확하지 않기 때문에

ret_list.count() 
/Users/i854319/spark/python/pyspark/rdd.pyc in count(self) 
    1002   3 
    1003   """ 
-> 1004   return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 
    1005 
    1006  def stats(self): 

/Users/i854319/spark/python/pyspark/rdd.pyc in sum(self) 
    993   6.0 
    994   """ 
--> 995   return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) 
    996 
    997  def count(self): 

/Users/i854319/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op) 
    867   # zeroValue provided to each partition is unique from the one provided 
    868   # to the final reduce call 
--> 869   vals = self.mapPartitions(func).collect() 
    870   return reduce(op, vals, zeroValue) 
    871 

/Users/i854319/spark/python/pyspark/rdd.pyc in collect(self) 
    769   """ 
    770   with SCCallSiteSync(self.context) as css: 
--> 771    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
    772   return list(_load_from_socket(port, self._jrdd_deserializer)) 
    773 

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args) 
    811   answer = self.gateway_client.send_command(command) 
    812   return_value = get_return_value(
--> 813    answer, self.gateway_client, self.target_id, self.name) 
    814 
    815   for temp_arg in temp_args: 

/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 
    43  def deco(*a, **kw): 
    44   try: 
---> 45    return f(*a, **kw) 
    46   except py4j.protocol.Py4JJavaError as e: 
    47    s = e.java_exception.toString() 

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 
    306     raise Py4JJavaError(
    307      "An error occurred while calling {0}{1}{2}.\n". 
--> 308      format(target_id, ".", name), value) 
    309    else: 
    310     raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 129.0 failed 1 times, most recent failure: Lost task 2.0 in stage 129.0 (TID 189, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/Users/i854319/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/Users/i854319/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 2346, in pipeline_func 
    return func(split, prev_func(split, iterator)) 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 2346, in pipeline_func 
    return func(split, prev_func(split, iterator)) 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 2346, in pipeline_func 
    return func(split, prev_func(split, iterator)) 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 317, in func 
    return f(iterator) 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 1004, in <lambda> 
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 
    File "/Users/i854319/spark/python/pyspark/rdd.py", line 1004, in <genexpr> 
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() 
    File "<ipython-input-173-694d23c67c86>", line 10, in htmlParsing 
    File "/Users/i854319/anaconda/lib/python2.7/site-packages/bs4/__init__.py", line 176, in __init__ 
    elif len(markup) <= 256: 
TypeError: object of type 'NoneType' has no len() 

답변

3

이/None (파이썬) 값과 계산과 아무 상관이 발생합니다. 당신은 삭제할 수 있습니다

TypeError   
... 
TypeError: object of type 'NoneType' has no len() 

BeautifulSoup(None, "lxml") 
귀하의 요구 사항에 따라 :

sdf_rss.na.drop(subset=["desc"]).rdd.map(...) 

을 또는 채우기 : 파서가 텍스트 대신 None를 얻을 때 예외와 함께 실패합니다 당신이 볼

sdf_rss.na.fill({"desc": ""}).rdd.map(...) 
매핑 전에 값은

NULL입니다. 구문 분석하기 전에 None에 대한

try: 
    souptext = BeautifulSoup(row['desc']) 
    ... 
except TypeError: 
    ... 

검사 :

는 명시 적으로 예외 처리를 추가

if row['desc'] is not None: 
    souptext = BeautifulSoup(row['desc']) 
    ... 
else: 
    ... 

또는 기본 빈 문자열 :

souptext = BeautifulSoup(row['desc'] or '') 

또한 단순화 할 수 udf을 사용하는 것이 좋습니다 과정 :

from pyspark.sql.functions import udf 
from pyspark.sql import Column 
from typing import Union 

def parse_html(col: str) -> Column: 
    def parse_html_(desc: Union[None, str]) -> Union[None, str]: 
     if desc is not None: 
      ps = BeautifulSoup(desc, "lxml").find_all('p') 
      return "".join(p.string for p in ps) 
    return udf(parse_html_)(col) 

(sc 
    .parallelize([ 
     (1, "foo", "<div><p>foo</p> <p>bar</p></div>",), 
     (2, "bar", None,)]) 
    .toDF(["id", "title", "desc"]) 
    .select("id","title", parse_html("desc").alias("desc"))) 
+---+-----+-------+ 
| id|title| desc| 
+---+-----+-------+ 
| 1| foo|foo bar| 
| 2| bar| null| 
+---+-----+-------+ 

당신은 잘 XML을 형성하고 xpath* UDFs를 사용하지만 훨씬 적은 강력한 BeautifulSoup보다가 수 하이브 지원을 활성화 한 가정하면.

관련 문제