Word Proximity의 트랙 유지

전 문서 모음 내에서 사전 기반 텍스트 검색과 관련된 작은 프로젝트를 진행하고 있습니다. 내 사전에는 긍정적 인 신호 단어 (일명 좋은 단어)가 있지만 문서 모음에서는 단어를 찾는 것만으로 긍정적 인 결과를 보장하지는 않습니다. 예를 들어 (중요하지 않지만) 이러한 긍정적 인 단어의 근처에있을 수있는 부정적인 단어가있을 수 있기 때문입니다 . 문서 번호, 긍정적 인 단어 및 부정적인 단어에 대한 근접성을 포함하도록 행렬을 구성하려고합니다.Word Proximity의 트랙 유지

누구나 할 수있는 방법을 제안하십시오. 내 프로젝트는 매우 초기 단계에 있으므로 본문의 기본 예제를 제공하고 있습니다.

No significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide.

이 칸데 사르 탄 실렉 세틸, 글리 부 리드, 니페디핀, 디곡신, 와파린, 히드로 클로로 내 긍정적 인 단어있는 내 예를 들어 문서이며 의미있는 나의 부정적인 단어입니다. 내 긍정적 인 단어와 근원적 인 단어 사이의 근접성 (단어 기반) 매핑을하고 싶습니다.

누구든지 유용한 포인터를 줄 수 있습니까?

출처

2010-06-21 Shreyas Karnik

우선이 작업에 R을 사용하지 않기를 제안합니다. R은 많은 것들을 위해 훌륭하지만, 텍스트 조작은 그것들 중 하나가 아닙니다. 파이썬은 좋은 대안이 될 수 있습니다. 내가 R에이를 구현한다면

아마 같은 (매우 매우 거친을) 할 것, 말했다 :

# You will probably read these from an external file or a database 
goodWords <- c("candesartan cilexetil", "glyburide", "nifedipine", "digoxin", "blabla", "warfarin", "hydrochlorothiazide") 
badWords <- c("no significant", "other drugs") 

mytext <- "no significant drug interactions have been reported in studies of candesartan cilexetil given with other drugs such as glyburide, nifedipine, digoxin, warfarin, hydrochlorothiazide." 
mytext <- tolower(mytext) # Let's make life a little bit easier... 

goodPos <- NULL 
badPos <- NULL 

# First we find the good words 
for (w in goodWords) 
    { 
    pos <- regexpr(w, mytext) 
    if (pos != -1) 
     { 
     cat(paste(w, "found at position", pos, "\n")) 
     } 
    else  
     { 
     pos <- NA 
     cat(paste(w, "not found\n")) 
     } 

    goodPos <- c(goodPos, pos) 
    } 

# And then the bad words 
for (w in badWords) 
    { 
    pos <- regexpr(w, mytext) 
    if (pos != -1) 
     { 
     cat(paste(w, "found at position", pos, "\n")) 
     } 
    else  
     { 
     pos <- NA 
     cat(paste(w, "not found\n")) 
     } 

    badPos <- c(badPos, pos) 
    } 

# Note that we use -badPos so that when can calculate the distance with rowSums 
comb <- expand.grid(goodPos, -badPos) 
wordcomb <- expand.grid(goodWords, badWords) 
dst <- cbind(wordcomb, abs(rowSums(comb))) 

mn <- which.min(dst[,3]) 
cat(paste("The closest good-bad word pair is: ", dst[mn, 1],"-", dst[mn, 2],"\n"))

출처

2010-06-21 15:15:53 nico

나는 거의 내가 찾고있는 것을 얻었습니다. 고마워 니코! –

당신은

Natural Language Processing 중 하나 살펴나요 CRAN의 작업보기 또는
CRAN의 텍스트 마이닝 패키지 tm?

출처

2010-06-21 15:18:08

좋은 패키지, 그들을 몰랐어요! 그러나 R이 이런 종류의 분석을 수행하는 데 가장 좋은 도구라고 생각하지 않습니다. – nico

예, tm 패키지를 자주 사용합니다! –

Word Proximity의 트랙 유지

답변

관련 문제