문서 - 기능 매트릭스 (quanteda)에서 ngrams를 분할

문서 기능 매트릭스 (dfm)에서 ngram 기능을 예를 들어 다음과 같은 방식으로 분리 할 수 있는지 궁금했습니다. bigram은 두 개의 분리 된 unigram을 생성합니까?문서 - 기능 매트릭스 (quanteda)에서 ngrams를 분할

head(dfm, n = 3, nfeature = 4) 

docs  in_the great plenary emission_reduction 
    10752099  3  1  1     3 
    10165509  8  0  0     3 
    10479890  4  0  0     1

그래서, 위의 DFM은 다음과 같이 될 것이다 : 더 나은 이해를 위해

head(dfm, n = 3, nfeature = 4) 

docs  in great plenary emission the reduction 
    10752099 3  1  1  3 3   3 
    10165509 8  0  0  3 8   3 
    10479890 4  0  0  1 4   1

을 : 나는 독일어에서 영어 기능을 번역에서 DFM의 ngrams을 얻었다. Compounds ("Emissionsminderung")는 독어에서는 조용하지만 영어에서는 그렇지 않습니다 ("배출 감소").

미리 감사드립니다.

편집 : 재현 가능한 예로서 다음을 사용할 수 있습니다.

library(quanteda) 

eg.txt <- c('increase in_the great plenary', 
      'great plenary emission_reduction', 
      'increase in_the emission_reduction emission_increase') 
eg.corp <- corpus(eg.txt) 
eg.dfm <- dfm(eg.corp) 

head(eg.dfm)

출처

2017-05-24 uyanik

동일한 단어가 포함 된 2 개의 bigram이있는 경우 (예 : emission_reduction 및 emission_increase), 열의 숫자가 일반 단어 (explpl의 "emission" 이자형)? 면책 조항 : 여기 전문가가 아니 겠지만 어쩌면 나는 아무 의미가 있다고 말하고있는 것입니다 ... – digEmAll

예, 문서에서 bigram "emission_reduction"과 "emission_increase"를 두 번 사용하면 결과는 총 3 " 배출 ", 2"감소 ", 1"증가 " 예를 들어 "증가"도 unigram 기능으로 포함되어 있지만 "증가"의 합은 2 여야합니다. – uyanik

불행히도 dfm 형식을 알지 못하고 data.frames처럼 작동하는지 모르겠습니다 ... (예 : dput (head (dfm))의 출력 게시) – digEmAll

가장 좋은 방법은 (그것이 data.frame/matrix에 스파 스 dfm를 전환 이후는 RAM을 많이 사용할 수있는) 경우 나도 몰라,하지만 작동합니다

# turn the dft into a matrix (transposing it) 
DF <- as.data.frame(eg.dfm) 
MX <- t(DF) 
# split the current column names by '_' 
colsSplit <- strsplit(colnames(DF),'_') 
# replicate the rows of the matrix and give them the new split row names 
MX <-MX[unlist(lapply(1:length(colsSplit),function(idx) rep(idx,length(colsSplit[[idx]])))),] 
rownames(MX) <- unlist(colsSplit) 
# aggregate the matrix rows having the same name and transpose again 
MX2 <- t(do.call(rbind,by(MX,rownames(MX),colSums))) 
# turn the matrix into a dfm 
eg.dfm.res <- as.dfm(MX2)

결과 :

> eg.dfm.res 
Document-feature matrix of: 3 documents, 7 features (33.3% sparse). 
3 x 7 sparse Matrix of class "dfmSparse" 
     features 
docs emission great in increase plenary reduction the 
    text1  0  1 1  1  1   0 1 
    text2  1  1 0  0  1   1 0 
    text3  2  0 1  2  0   1 1

출처

2017-05-24 13:45:13 digEmAll

'DF <- as를 추가하면 완벽하게 정상적으로 작동하는 것 같습니다. .data.frame (eg.dfm)'처음에? – uyanik

수정되었습니다. 오타로 인해 불편을 끼쳐 드려 죄송합니다. – digEmAll

데이터 프레임과 함께 사용할 수있는 좋은 방법입니다. – uyanik

문서 - 기능 매트릭스 (quanteda)에서 ngrams를 분할

답변

관련 문제