R 대용량 데이터 세트에 대해 findInterval을 벡터화합니다.

findInterval을 사용하는 두 개의 데이터 프레임이 있습니다. 웰 보어 (Wellbore) 데이터는 오일을 생산하는 유정의 x, y, z 데이터입니다 (VSS = 수직 해저 깊이, md = 깊이 a.k.a 드릴 비트가 우물을 따라 이동 한 실제 거리). Perfs 데이터는 유정이 흐름을 허용하기 위해 구멍이 뚫린 데이터입니다 (top_perf = md, bot_perf = md).R 대용량 데이터 세트에 대해 findInterval을 벡터화합니다.

Perfs :

Well_ID top_perf bot_perf well_name surface ID x y VSS 
056-W  2808  2958  056-W  Ranger 2 0 0 0 
056-W  3150  3250  056-W  Ranger 1 0 0 0 
056-W  3150  3250  056-W  Ranger 2 0 0 0 
056-W  3559  3664  056-W  UT 1 1 0 0 0 
056-W  3559  3664  056-W  UT 2 2 0 0 0 
057-W  2471  2952  057-W  Tar 1 0 0 0 
057-W  2471  2952  057-W  Tar 2 0 0 0 
058-W  2615  2896  058-W  Ranger 1 0 0 0 
058-W  2615  2896  058-W  Ranger 2 0 0 0

유정 :

목표는 MD Perfs $ Well_ID = 유정 $ $를 유정 값에 가장입니다 Perfs $의 top_perf 및 Perfs $의 bot_perf을 찾을 수 있습니다

well_name well_id  md  vss   x  y   
056-W  056-W  3260 -3251.46 4221436 4030454 
056-W  056-W  3280 -3271.45 4221436 4030454 
056-W  056-W  3300 -3291.45 4221435 4030453 
056-W  056-W  3320 -3311.44 4221435 4030453 
056-W  056-W  3340 -3331.44 4221434 4030453 
056-W  056-W  3360 -3351.43 4221434 4030453 
056-W  056-W  3380 -3371.43 4221433 4030453 
056-W  056-W  3400 -3391.42 4221433 4030453

well_id 다음 Wellbore에서 vss, x 및 y를 추출하여 Perfs에 추가하십시오. (중간에 끼어들 경우 보간에 신경 쓰지 않고 가까운 것을 필요로합니다.) 여기

이 작업을 수행하려면 내 코드입니다 :

for(i in 1:dim(Perfs)[1]){ 
    if(Perfs$ID[i] == 1){ 
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),] 
    interval <- findInterval(Perfs[i,"top_perf"], Wellbore_temp$md) 
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")] 
    }else{ 
    Wellbore_temp <- Wellbore[which(Wellbore$well_id == Perfs[i,"Well_ID"]),] 
    interval <- findInterval(Perfs[i,"bot_perf"], Wellbore_temp$md) 
    Perfs[i,c("x","y","VSS")] <- Wellbore_temp[interval, c("x","y","vss")] 
    } 
}

이 코드 작품은, 그냥 제가 루프를 제거하고이 작업을 수행 얻을 수있는 방법이 사용됩니다 응용 프로그램에 대한 너무 느린 않습니다. 일을 더 빠르게하기 위해 더 많은 벡터화 된 방식? 또한 findInterval 외부의 제안을 열 수 있습니다.

출처

2016-11-29 Andrew Pruet

여기에 질문에 대한 답을 찾을 수 : Join R data.tables where key values are not exactly equal--combine rows with closest times

@의 ds440 여기

에서 제공하는 data.table의 생각에 기반을 내가 사용하는 코드이며 매우 빠른 실행 :

Perf.Data <- Perfs 


Wellbore.Perfs <- data.table(Wellbore[,c("well_id","md","vss")]) 
Spotfire.Top.Perf <- data.table(Perf.Data[,c("Well_ID","top_perf", "bot_perf")]) 
Spotfire.Bot.Perf <- data.table(Perf.Data[,c("Well_ID","bot_perf", "top_perf")]) 

#Change the column names to match up with Wellbore.Perfs 
#Add in the bot_perf to .top.perf and the top_perf to the .bot.perf is done to make these unique and ensure everything is captured from the perfs table 
colnames(Spotfire.Top.Perf) <- c("well_id","md", "bot_perf") 
colnames(Spotfire.Bot.Perf) <- c("well_id","md","top_perf") 

#set key to join on 
setkey(Wellbore.Perfs, "well_id","md") 

#roll = "nearest" will take the nearest value of md in .top.perf or .bot.perf and match it to the md in wellbore.perfs where Well_ID = Well_ID 
Perfs.Wellbore.Top <- Wellbore.Perfs[Spotfire.Top.Perf, roll = "nearest"] 
Perfs.Wellbore.Bot <- Wellbore.Perfs[Spotfire.Bot.Perf, roll = "nearest"]

출처

2016-11-29 20:44:38

표시된 샘플 데이터를 실행하면 '가장 가까운'일치 항목 중 일부는 상당히 나빠지지만 거리를 계산하지 않으면 표시되지 않습니다. – ds440

나는이 글을 원래 게시물에서 언급 했어야 만하지만 Wellbore 테이블은 MD의 범위가 0에서부터 우물 바닥까지의 범위가 각 우물마다 20 씩 증가했다. 좋은 지적. –

아래에 나는 data.table 솔루션을 제시합니다. 필자는 여러분이 보여준 데이터의 작은 부분 집합에 대해서만 테스트를했으며, 그 작은 데이터 집합에서 솔루션보다 느리게 작동하지만 더 잘 확장 될 것이라고 생각합니다. 그렇지 않은 경우 병렬 처리를 고려하십시오.

전에 data.table을 사용하지 않았다면 꽤 빠르다고 생각하지만 구문은 다소 복잡 할 수 있습니다. .SD은 perfs 데이터의 행 i와 결합하는 웰 보어 데이터의 서브 세트를 참조합니다 (반복은 .EACHI입니다). 이것은 모든 것에 대한 거대한 결합을 저장합니다. findInterval 함수를 사용하는 대신 오류 (top_perf - md 또는 bot_perf - md)를 계산하고 절대 오류를 최소화합니다. 롤 결합 ('가장 가까운')에 비해이 접근법의 장점은 오류가 무엇인지 확인하고 필요한 경우 필터링 할 수 있다는 것입니다.

library(data.table) 

Perfs <- fread(input = 'Well_ID top_perf bot_perf well_name surface ID x y VSS 
056-W  2808  2958  056-W  Ranger 2 0 0 0 
056-W  3150  3250  056-W  Ranger 1 0 0 0 
056-W  3150  3250  056-W  Ranger 2 0 0 0 
056-W  3559  3664  056-W  UT_1 1 0 0 0 
056-W  3559  3664  056-W  UT_2 2 0 0 0 
057-W  2471  2952  057-W  Tar 1 0 0 0 
057-W  2471  2952  057-W  Tar 2 0 0 0 
058-W  2615  2896  058-W  Ranger 1 0 0 0 
058-W  2615  2896  058-W  Ranger 2 0 0 0') 

Wellbore <- fread(input = 'well_name well_id  md  vss   x  y   
056-W  056-W  3260 -3251.46 4221436 4030454 
056-W  056-W  3280 -3271.45 4221436 4030454 
056-W  056-W  3300 -3291.45 4221435 4030453 
056-W  056-W  3320 -3311.44 4221435 4030453 
056-W  056-W  3340 -3331.44 4221434 4030453 
056-W  056-W  3360 -3351.43 4221434 4030453 
056-W  056-W  3380 -3371.43 4221433 4030453 
056-W  056-W  3400 -3391.42 4221433 4030453') 


#top 
setkey(Wellbore, 'well_id') 
setkey(Perfs, 'Well_ID', 'top_perf') 
top_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(top_perf-md)),.(md, top_perf, err=top_perf-md, x,y,vss)],nomatch=0, by=.EACHI] 
setkey(top_matched, 'well_id', 'top_perf') 
top_joined <- top_matched[Perfs] 
top_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)] 
setnames(top_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('top_', c('err', 'x', 'y', 'vss'))) 

#bottom 
setkey(Perfs, 'Well_ID', 'bot_perf') 
bot_matched <- Wellbore[unique(Perfs), .SD[which.min(abs(bot_perf-md)),.(md, bot_perf, err=bot_perf-md, x,y,vss)],nomatch=0, by=.EACHI] 
setkey(bot_matched, 'well_id', 'bot_perf') 
bot_joined <- bot_matched[Perfs] 
bot_joined[,`:=`(i.x=NULL, i.y=NULL,VSS=NULL)] 
setnames(bot_joined, old=c('err', 'x', 'y', 'vss'), new=paste0('bot_', c('err', 'x', 'y', 'vss'))) 


answer <- cbind(top_joined[,c(1:2,9:11,3:7), with=F], bot_joined[,3:7,with=F]) 

# well_id md well_name surface ID top_perf top_err top_x top_y top_vss bot_perf bot_err 
# 1: 056-W 3260  056-W Ranger 2  2808 -452 4221436 4030454 -3251.46  2958 -302 
# 2: 056-W 3260  056-W Ranger 1  3150 -110 4221436 4030454 -3251.46  3250  -10 
# 3: 056-W 3260  056-W Ranger 2  3150 -110 4221436 4030454 -3251.46  3250  -10 
# 4: 056-W 3400  056-W UT_1 1  3559  159 4221433 4030453 -3391.42  3664  264 
# 5: 056-W 3400  056-W UT_2 2  3559  159 4221433 4030453 -3391.42  3664  264 
# 6: 057-W NA  057-W  Tar 1  2471  NA  NA  NA  NA  2952  NA 
# 7: 057-W NA  057-W  Tar 2  2471  NA  NA  NA  NA  2952  NA 
# 8: 058-W NA  058-W Ranger 1  2615  NA  NA  NA  NA  2896  NA 
# 9: 058-W NA  058-W Ranger 2  2615  NA  NA  NA  NA  2896  NA 
# bot_x bot_y bot_vss 
# 1: 4221436 4030454 -3251.46 
# 2: 4221436 4030454 -3251.46 
# 3: 4221436 4030454 -3251.46 
# 4: 4221433 4030453 -3391.42 
# 5: 4221433 4030453 -3391.42 
# 6:  NA  NA  NA 
# 7:  NA  NA  NA 
# 8:  NA  NA  NA 
# 9:  NA  NA  NA

출처

2016-11-29 04:03:03 ds440

고맙습니다. 중간 점과 오류 기능을 사용하면 top_perf-bot_perf = 300과 같은 몇 가지 문제가 발생합니다. 중간 점은 그 퍼프 위치 실제로 wellbore를 따라/거기에 top_perf와 bot_perf 사이의 구별이 없습니다. 이것의 목적은 매핑을위한 것이므로 실제와 150 피트 떨어진 지점을 보여줄 여력이 없습니다. –

각각'top_perf'와'bot_perf'를 사용하여 ('midpt' 대신)'top_matched'와'bottom_matched'을 만들 수 있습니다. 오류를 계산하면 관련이없는 일치 항목을 필터링 할 수 있습니다. – ds440

중간 점을 없애고 이제 md와 상단 또는 하단을 직접 비교하여 편집했습니다. – ds440

R 대용량 데이터 세트에 대해 findInterval을 벡터화합니다.

답변

관련 문제