2014-11-14 2 views
1

입력 데이터의 유효성을 검사하는 방법은 dml에 따라 정확합니다.돼지의 입력 데이터를 확인하는 방법은 dml에 따른다

입력 데이터 : Jorge Posada | 양키 | Landon Powell | Oakland | (Catcher, 2000), (First_baseman, 2001)} | [on_base_percentage #] ([Catcher, 2000], [Designers_hitter, 2001]} [게임 # 1594, hit_by_pitch # 65, grand_slams # 7] 랜든 파웰 | 0.297, 게임 # 26, 홈런 # 7] Martin Prado | {(Second_baseman, 2002), (내야수, 2003), (Left_fielder)} | [게임 # 258, hit_by_pitch # 3]

굵은 부분을 참조하십시오, 나는 올해 필드를 놓쳤다. bigg = map [] (chararray, team : chararray, pos : bag {t : tuple (point : chararray, year : int)}로 PigStorage ('|')를 사용하여 bfile = LOAD 'basketball1.txt');

덤프 bfile; (Landing Powell, Oakland, {(Catcher, 2000), First_baseman, 2001)}) (마틴 프라도 애틀랜타, [게임 # 258 hit_by_pitch # 3])

안부 Sanjeeb

+0

입력을 확인하기 위해 샘플을 더 추가 할 수 있습니까? 유효하거나 무효 한 것. –

답변

1
[# 0.297, 게임의 # 26 on_base_percentage # 7 home_runs]

다음은 스키마의 정규식 스크립트입니다. 대부분 모든 필드의 유효성을 검사했습니다. 귀하의 의견을 참고하여 다른 검증이 필요한 경우 알려주십시오.

정규식 :

A = LOAD 'input.txt' AS line; 
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s*([A-Za-z]+)\\s*\\|\\s*(\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s*(\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\])$')) AS (name:chararray,team:chararray,pos:bag{t:(p:chararray)},bat:map[]);; 
DUMP B; 
012 :

'^ 
    ([A-Za-z]+\\s+[A-Za-z]+)\\s*\\|\\s* 
    ([A-Za-z]+)\\s*\\|\\s* 
    (\\{(?:\\([A-Za-z_]+,[0-9]+\\))(?:,\\([A-Za-z_]+,[0-9]+\\))*\\})\\s*\\|\\s* 
    (\\[(?:[A-Za-z_]+#[0-9\\.]+)(?:,[A-Za-z_]+#[0-9\\.]+)*\\]) 
$' 


input.txt를 내가 입력 아래 각각

Jorge Posada |Yankees| {(Catcher,2000),(Designated_hitter,2001)}|[games#1594,hit_by_pitch#65,grand_slams#7] -->Valid 
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001)}|[on_base_percentage#0.297,games#26,home_runs#7] ->Valid 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003),(Left_fielder)}|[games#258,hit_by_pitch#3] -->Invalid year missing 
Martin Prado |Atlanta| {(Second_baseman,2002)(Infielder,2003)}|[games#258,hit_by_pitch#3] ->Invalid no comma between two tuples 
Martin Prado |Atlanta| {,(Second_baseman,2002),(Infielder,2003)}|[games#258,hit_by_pitch#3] --> Invalid comma in the start of tuple 
Martin Prado |Atlanta| {(Second_baseman,2002),(,2003)}|[games#258,hit_by_pitch#3] -->Invalid position is missing 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Demiiter | is missing 
Martin Prado || {(Second_baseman,2002),(Infielder,2003)}[games#258,hit_by_pitch#3] --> Invalid Team name is missing 
Martin Prado |Atlanta| {(Second_baseman,2002),(Infielder,2003)}[games#,hit_by_pitch#3] --> Invalid Key value is missing for games 
Landon Powell |Oakland|{(Catcher,2000)}|[on_base_percentage#0.297] --> Valid 
Landon Powell |Oakland|{(Catcher,2000),(First_baseman,2001),(test,3000)}|[on_base_percentage#0.297,games#26,home_runs#7,test#1.2] -->valid 

PigScript 유효 또는 무효 표시 한

출력 : 입력이 스키마와 일치하지 않으면 출력이 null로 인쇄됩니다.

(Jorge Posada,Yankees,{(Catcher,2000),(Designated_hitter,2001)},[games#1594,hit_by_pitch#65,grand_slams#7]) -->Valid 
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001)},[on_base_percentage#0.297,games#26,home_runs#7]) -->Valid 
() -->Invalid,Year missing 
() -->Invalid,No comma between two tuples 
() -->Invalid,Comma in the start of tuple 
() -->Invalid,Position is missing 
() -->Invalid,Demiiter | is missing 
() -->Invalid Team name is missing 
() -->Invalid Key value is missing for games 
(Landon Powell,Oakland,{(Catcher,2000)},[on_base_percentage#0.297]) -->Valid 
(Landon Powell,Oakland,{(Catcher,2000),(First_baseman,2001),(test,3000)},[on_base_percentage#0.297,games#26,home_runs#7,test#1.2]) -->valid 
관련 문제