2013-03-11 11 views
0

데이터를 처리하기 위해 돼지를 사용하고 있습니다.정규 표현식 정규 표현식 돼지

<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="1" gen="" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Vacation"/><S uid="2" gen="" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Mother earth"/><S uid="3" gen="" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="4" gen="" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Vocalise"/><S uid="5" gen="" art="Kitschi cupid" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung BeatDJ" ttl="Hard beat floor"/><S uid="6" gen="" yr="2011" art="David Kater" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Nothing left to say"/><S uid="7" gen="" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Morning Dew"/><S uid="12" gen="" art="&lt;unknown&gt;" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/download" alb="download" ttl="mirzaghalib6_www.songs.pk_"/><S uid="13" gen="" art="&lt;unknown&gt;" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/download" alb="download" ttl="mirzaghalib7_www.songs.pk_"/><S uid="4555" gen="" yr="2012" art="Javed Ali &amp; Shakthisree Gopalan" cmp="Music: A.R. Rahman | Lyrics: Gulzar" fld="/mnt/sdcard/WhatsApp/Media/WhatsApp Audio" alb="Jab Tak Hai Jaan" ttl="Jab Tak Hai Jaan - www.Songs.PK"/></SC><PC/></MC>) 

내 목표는 상호 의존적으로 HDFS에서 ""= 구문 분석과 예술의 항목을 저장하는 것입니다 :

내 데이터는 것 같습니다.

A= load 'smalltestdata' USING TextLoader() AS (line:chararray); 
data_split=FILTER C BY (line matches '.*art=.*'); 

암 I없는 아무것도 :

나는 다음과 같은 PIG 명령을 사용?

답변

1

art=" 후 정보 만 얻을 "하기 전에 다음 정규식 사용하려면 무슨 일이 일어나고 여기

(?<=art\=")(.*?)(?=") 

을 :

1. (?<=art\=") - This is a lookbehind. It will look for matches after `art="` 
2. (.*?)  - This is the search string that is returned. The `?` makes it non-greedy, so it only grabs the least number of finds 
3. (?=")  - This is a lookahead. It will search for things before `"` 

Lookbehinds 및 lookaheads 반환되지 않으므로 결과 것입니다 art="" 사이의 모든 텍스트가 있어야합니다.