phrases을 자동 완성해야합니다. 예를 들어, "알츠의 치매"을 검색하면 의 알츠하이머 치매가 발생합니다..문구가 일치하는 가장자리 NGram

이 경우 Edge NGram tokenizer으로 설정했습니다. 나는 쿼리 본문의 분석기로 edge_ngram_analyzer과 standard을 시도했다. 그럼에도 불구하고 구문을 일치 시키려고 할 때 결과를 얻을 수 없습니다.

내가 뭘 잘못하고 있니?

내 쿼리 :

{ 
    "query":{ 
    "multi_match":{ 
     "query":"dementia in alz", 
     "type":"phrase", 
     "analyzer":"edge_ngram_analyzer", 
     "fields":["_all"] 
    } 
    } 
}

내 매핑 :

... 
"type" : { 
    "_all" : { 
    "analyzer" : "edge_ngram_analyzer", 
    "search_analyzer" : "standard" 
    }, 
    "properties" : { 
    "field" : { 
     "type" : "string", 
     "analyzer" : "edge_ngram_analyzer", 
     "search_analyzer" : "standard" 
    }, 
... 
"settings" : { 
    ... 
    "analysis" : { 
    "filter" : { 
     "stem_possessive_filter" : { 
     "name" : "possessive_english", 
     "type" : "stemmer" 
     } 
    }, 
    "analyzer" : { 
     "edge_ngram_analyzer" : { 
     "filter" : [ "lowercase" ], 
     "tokenizer" : "edge_ngram_tokenizer" 
     } 
    }, 
    "tokenizer" : { 
     "edge_ngram_tokenizer" : { 
     "token_chars" : [ "letter", "digit", "whitespace" ], 
     "min_gram" : "2", 
     "type" : "edgeNGram", 
     "max_gram" : "25" 
     } 
    } 
    } 
    ...

내 문서 다음 "알츠하이머 치매"의

{ "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfHfBE5CzEm8aJ3Xp", "_source": { "@timestamp": "2016-08-02T13:40:48.665Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1400541", "Diagnosis": "F00.0 - Dementia in Alzheimer's disease with early onset", "@version": "1", }, "_index": "carenotes" }, { "_score": 1.1152233, "_type": "Diagnosis", "_id": "AVZLfICrE5CzEm8aJ4Dc", "_source": { "@timestamp": "2016-08-02T13:40:51.240Z", "type": "Diagnosis", "Document_ID": "Diagnosis_1424351", "Diagnosis": "F00.1 - Dementia in Alzheimer's disease with late onset", "@version": "1", }, "_index": "carenotes" }

분석 문구 :

,536,913,632 10

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "de", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "dem", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 4, 
     "token": "deme", 
     "type": "word", 
     "start_offset": 0, 
     "position": 2 
    }, 
    { 
     "end_offset": 5, 
     "token": "demen", 
     "type": "word", 
     "start_offset": 0, 
     "position": 3 
    }, 
    { 
     "end_offset": 6, 
     "token": "dement", 
     "type": "word", 
     "start_offset": 0, 
     "position": 4 
    }, 
    { 
     "end_offset": 7, 
     "token": "dementi", 
     "type": "word", 
     "start_offset": 0, 
     "position": 5 
    }, 
    { 
     "end_offset": 8, 
     "token": "dementia", 
     "type": "word", 
     "start_offset": 0, 
     "position": 6 
    }, 
    { 
     "end_offset": 9, 
     "token": "dementia ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 7 
    }, 
    { 
     "end_offset": 10, 
     "token": "dementia i", 
     "type": "word", 
     "start_offset": 0, 
     "position": 8 
    }, 
    { 
     "end_offset": 11, 
     "token": "dementia in", 
     "type": "word", 
     "start_offset": 0, 
     "position": 9 
    }, 
    { 
     "end_offset": 12, 
     "token": "dementia in ", 
     "type": "word", 
     "start_offset": 0, 
     "position": 10 
    }, 
    { 
     "end_offset": 13, 
     "token": "dementia in a", 
     "type": "word", 
     "start_offset": 0, 
     "position": 11 
    }, 
    { 
     "end_offset": 14, 
     "token": "dementia in al", 
     "type": "word", 
     "start_offset": 0, 
     "position": 12 
    }, 
    { 
     "end_offset": 15, 
     "token": "dementia in alz", 
     "type": "word", 
     "start_offset": 0, 
     "position": 13 
    }, 
    { 
     "end_offset": 16, 
     "token": "dementia in alzh", 
     "type": "word", 
     "start_offset": 0, 
     "position": 14 
    }, 
    { 
     "end_offset": 17, 
     "token": "dementia in alzhe", 
     "type": "word", 
     "start_offset": 0, 
     "position": 15 
    }, 
    { 
     "end_offset": 18, 
     "token": "dementia in alzhei", 
     "type": "word", 
     "start_offset": 0, 
     "position": 16 
    }, 
    { 
     "end_offset": 19, 
     "token": "dementia in alzheim", 
     "type": "word", 
     "start_offset": 0, 
     "position": 17 
    }, 
    { 
     "end_offset": 20, 
     "token": "dementia in alzheime", 
     "type": "word", 
     "start_offset": 0, 
     "position": 18 
    }, 
    { 
     "end_offset": 21, 
     "token": "dementia in alzheimer", 
     "type": "word", 
     "start_offset": 0, 
     "position": 19 
    } 
    ] 
}

출처

2016-08-09 trex

multi_match 대신 query_string을 사용하려고 했습니까? 문제가 해결되면 알려주세요. –

'query_string'은 기본적으로'_all' 필드를 검색합니다. 그래서,'multi_match'와''fields ": ["_all "]'로 여기에서했던 것과 같습니다. 그럼에도 불구하고, 나는 그것을 시도했다. 성공하지 못했다. 나는 다음의 쿼리'{ 'query': { 'query_string': { 'query': 'alzh', 'phrase_slop': 0}}}의' – trex

많은 감사 사람들을 제거하기 위해 당신은 trim token_filter 필요 !

Andrei Stefan의 솔루션이 최적이 아닙니다.

왜? 첫째, 검색 분석기에 소문자 필터가 없으면 검색이 불편 해집니다. 사건은 엄격하게 일치해야합니다. "analyzer": "keyword" 대신 lowercase 필터가있는 사용자 정의 분석기가 필요합니다.

둘째, 분석 부분이 잘못되었습니다.! 색인 시간 동안 "F00.0 - 조기 개시 알츠하이머 병 치매"은 edge_ngram_analyzer으로 분석됩니다. 이 분석기, 우리는 분석 된 문자열로 사전의 다음과 같은 배열을 가지고

{ 
    "tokens": [ 
    { 
     "end_offset": 2, 
     "token": "f0", 
     "type": "word", 
     "start_offset": 0, 
     "position": 0 
    }, 
    { 
     "end_offset": 3, 
     "token": "f00", 
     "type": "word", 
     "start_offset": 0, 
     "position": 1 
    }, 
    { 
     "end_offset": 6, 
     "token": "0 ", 
     "type": "word", 
     "start_offset": 4, 
     "position": 2 
    }, 
    { 
     "end_offset": 9, 
     "token": " ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 3 
    }, 
    { 
     "end_offset": 10, 
     "token": " d", 
     "type": "word", 
     "start_offset": 7, 
     "position": 4 
    }, 
    { 
     "end_offset": 11, 
     "token": " de", 
     "type": "word", 
     "start_offset": 7, 
     "position": 5 
    }, 
    { 
     "end_offset": 12, 
     "token": " dem", 
     "type": "word", 
     "start_offset": 7, 
     "position": 6 
    }, 
    { 
     "end_offset": 13, 
     "token": " deme", 
     "type": "word", 
     "start_offset": 7, 
     "position": 7 
    }, 
    { 
     "end_offset": 14, 
     "token": " demen", 
     "type": "word", 
     "start_offset": 7, 
     "position": 8 
    }, 
    { 
     "end_offset": 15, 
     "token": " dement", 
     "type": "word", 
     "start_offset": 7, 
     "position": 9 
    }, 
    { 
     "end_offset": 16, 
     "token": " dementi", 
     "type": "word", 
     "start_offset": 7, 
     "position": 10 
    }, 
    { 
     "end_offset": 17, 
     "token": " dementia", 
     "type": "word", 
     "start_offset": 7, 
     "position": 11 
    }, 
    { 
     "end_offset": 18, 
     "token": " dementia ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 12 
    }, 
    { 
     "end_offset": 19, 
     "token": " dementia i", 
     "type": "word", 
     "start_offset": 7, 
     "position": 13 
    }, 
    { 
     "end_offset": 20, 
     "token": " dementia in", 
     "type": "word", 
     "start_offset": 7, 
     "position": 14 
    }, 
    { 
     "end_offset": 21, 
     "token": " dementia in ", 
     "type": "word", 
     "start_offset": 7, 
     "position": 15 
    }, 
    { 
     "end_offset": 22, 
     "token": " dementia in a", 
     "type": "word", 
     "start_offset": 7, 
     "position": 16 
    }, 
    { 
     "end_offset": 23, 
     "token": " dementia in al", 
     "type": "word", 
     "start_offset": 7, 
     "position": 17 
    }, 
    { 
     "end_offset": 24, 
     "token": " dementia in alz", 
     "type": "word", 
     "start_offset": 7, 
     "position": 18 
    }, 
    { 
     "end_offset": 25, 
     "token": " dementia in alzh", 
     "type": "word", 
     "start_offset": 7, 
     "position": 19 
    }, 
    { 
     "end_offset": 26, 
     "token": " dementia in alzhe", 
     "type": "word", 
     "start_offset": 7, 
     "position": 20 
    }, 
    { 
     "end_offset": 27, 
     "token": " dementia in alzhei", 
     "type": "word", 
     "start_offset": 7, 
     "position": 21 
    }, 
    { 
     "end_offset": 28, 
     "token": " dementia in alzheim", 
     "type": "word", 
     "start_offset": 7, 
     "position": 22 
    }, 
    { 
     "end_offset": 29, 
     "token": " dementia in alzheime", 
     "type": "word", 
     "start_offset": 7, 
     "position": 23 
    }, 
    { 
     "end_offset": 30, 
     "token": " dementia in alzheimer", 
     "type": "word", 
     "start_offset": 7, 
     "position": 24 
    }, 
    { 
     "end_offset": 33, 
     "token": "s ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 25 
    }, 
    { 
     "end_offset": 34, 
     "token": "s d", 
     "type": "word", 
     "start_offset": 31, 
     "position": 26 
    }, 
    { 
     "end_offset": 35, 
     "token": "s di", 
     "type": "word", 
     "start_offset": 31, 
     "position": 27 
    }, 
    { 
     "end_offset": 36, 
     "token": "s dis", 
     "type": "word", 
     "start_offset": 31, 
     "position": 28 
    }, 
    { 
     "end_offset": 37, 
     "token": "s dise", 
     "type": "word", 
     "start_offset": 31, 
     "position": 29 
    }, 
    { 
     "end_offset": 38, 
     "token": "s disea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 30 
    }, 
    { 
     "end_offset": 39, 
     "token": "s diseas", 
     "type": "word", 
     "start_offset": 31, 
     "position": 31 
    }, 
    { 
     "end_offset": 40, 
     "token": "s disease", 
     "type": "word", 
     "start_offset": 31, 
     "position": 32 
    }, 
    { 
     "end_offset": 41, 
     "token": "s disease ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 33 
    }, 
    { 
     "end_offset": 42, 
     "token": "s disease w", 
     "type": "word", 
     "start_offset": 31, 
     "position": 34 
    }, 
    { 
     "end_offset": 43, 
     "token": "s disease wi", 
     "type": "word", 
     "start_offset": 31, 
     "position": 35 
    }, 
    { 
     "end_offset": 44, 
     "token": "s disease wit", 
     "type": "word", 
     "start_offset": 31, 
     "position": 36 
    }, 
    { 
     "end_offset": 45, 
     "token": "s disease with", 
     "type": "word", 
     "start_offset": 31, 
     "position": 37 
    }, 
    { 
     "end_offset": 46, 
     "token": "s disease with ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 38 
    }, 
    { 
     "end_offset": 47, 
     "token": "s disease with e", 
     "type": "word", 
     "start_offset": 31, 
     "position": 39 
    }, 
    { 
     "end_offset": 48, 
     "token": "s disease with ea", 
     "type": "word", 
     "start_offset": 31, 
     "position": 40 
    }, 
    { 
     "end_offset": 49, 
     "token": "s disease with ear", 
     "type": "word", 
     "start_offset": 31, 
     "position": 41 
    }, 
    { 
     "end_offset": 50, 
     "token": "s disease with earl", 
     "type": "word", 
     "start_offset": 31, 
     "position": 42 
    }, 
    { 
     "end_offset": 51, 
     "token": "s disease with early", 
     "type": "word", 
     "start_offset": 31, 
     "position": 43 
    }, 
    { 
     "end_offset": 52, 
     "token": "s disease with early ", 
     "type": "word", 
     "start_offset": 31, 
     "position": 44 
    }, 
    { 
     "end_offset": 53, 
     "token": "s disease with early o", 
     "type": "word", 
     "start_offset": 31, 
     "position": 45 
    }, 
    { 
     "end_offset": 54, 
     "token": "s disease with early on", 
     "type": "word", 
     "start_offset": 31, 
     "position": 46 
    }, 
    { 
     "end_offset": 55, 
     "token": "s disease with early ons", 
     "type": "word", 
     "start_offset": 31, 
     "position": 47 
    }, 
    { 
     "end_offset": 56, 
     "token": "s disease with early onse", 
     "type": "word", 
     "start_offset": 31, 
     "position": 48 
    } 
    ] 
}

당신이 볼 수 있듯이, 2 ~ 25 자에서 토큰 크기 토큰 화 된 전체 문자열. 문자열은 모든 새로운 토큰마다 모든 공백 및 위치와 함께 선형 방식으로 토큰 화됩니다.

몇 가지 문제가 함께 있습니다

가 edge_ngram_analyzer 예를 들어, 검색되지 않습니다 unuseful 토큰 제작 : "", "", "D를", " 이 W ","병을 SD "등

또한, 그것은,251,761를 많이 생산하지 않았다 예 : "질병", "조기 발병"등과 같은 유용한 토큰 이러한 단어를 검색하려고하면 0 개의 결과가 표시됩니다.

마지막 토큰은 "s 초기 질병이있는 질병"입니다. 최종 "t"은 어디에 있습니까? "max_gram" : "25" 때문에 우리는 "lost"모든 필드에 일부 텍스트가 있습니다. 토큰이 없어이 텍스트를 더 이상 검색 할 수 없습니다.

trim 필터는 토크 나이저로 수행 할 수있을 때 여분의 공간을 필터링하는 문제 만 난처합니다.

edge_ngram_analyzer은 구문 쿼리와 같은 위치 쿼리에 문제가되는 각 토큰의 위치를 증가시킵니다. edge_ngram_filter 대신에 은 ngram을 생성 할 때 토큰의 위치을 보존해야합니다.

최적의 솔루션입니다.

매핑 설정 사용하기 :
텍스트가 standard 토크 나이 토큰 화 인덱스 시간에
{ "tokens": [ { "end_offset": 5, "token": "f0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 5, "token": "f00.0", "type": "word", "start_offset": 0, "position": 0 }, { "end_offset": 17, "token": "de", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dem", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "deme", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "demen", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dement", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementi", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 17, "token": "dementia", "type": "word", "start_offset": 9, "position": 2 }, { "end_offset": 20, "token": "in", "type": "word", "start_offset": 18, "position": 3 }, { "end_offset": 32, "token": "al", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alz", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzh", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhe", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzhei", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheim", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheime", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 32, "token": "alzheimer", "type": "word", "start_offset": 21, "position": 4 }, { "end_offset": 40, "token": "di", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dis", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "dise", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disea", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "diseas", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 40, "token": "disease", "type": "word", "start_offset": 33, "position": 5 }, { "end_offset": 45, "token": "wi", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "wit", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 45, "token": "with", "type": "word", "start_offset": 41, "position": 6 }, { "end_offset": 51, "token": "ea", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "ear", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "earl", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 51, "token": "early", "type": "word", "start_offset": 46, "position": 7 }, { "end_offset": 57, "token": "on", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "ons", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onse", "type": "word", "start_offset": 52, "position": 8 }, { "end_offset": 57, "token": "onset", "type": "word", "start_offset": 52, "position": 8 } ] }

후 별도의 단어, possessive_englishlowercase에 의해 필터링 : 분석에서

... "mappings": { "Type": { "_all":{ "analyzer": "edge_ngram_analyzer", "search_analyzer": "keyword_analyzer" }, "properties": { "Field": { "search_analyzer": "keyword_analyzer", "type": "string", "analyzer": "edge_ngram_analyzer" }, ... ... "settings": { "analysis": { "filter": { "english_poss_stemmer": { "type": "stemmer", "name": "possessive_english" }, "edge_ngram": { "type": "edgeNGram", "min_gram": "2", "max_gram": "25", "token_chars": ["letter", "digit"] } }, "analyzer": { "edge_ngram_analyzer": { "filter": ["lowercase", "english_poss_stemmer", "edge_ngram"], "tokenizer": "standard" }, "keyword_analyzer": { "filter": ["lowercase", "english_poss_stemmer"], "tokenizer": "standard" } } } } ...

봐 필터는 edge_ngram입니다. 토큰은 단어로만 생성됩니다. 텍스트가 standard 토크 나이저로 토큰 화 된 후 별도의 단어는 lowercase 및 possessive_english으로 필터링됩니다. 검색된 단어는 색인 시간 동안 작성된 토큰과 대조됩니다.

따라서 우리는 증분 검색을 가능하게합니다! 이제

, 우리는 별도의 단어 NGRAM 않기 때문에, 우리는 심지어

{ 'query': { 'multi_match': { 'query': 'dem in alzh', 'type': 'phrase', 'fields': ['_all'] } } }

같은 쿼리를 실행하고 올바른 결과를 얻을 수 있습니다.

텍스트가 손실되지 않고 모든 것을 검색 할 수 있으며 더 이상 필터를 사용하지 않아도됩니다. trim 필터를 사용하십시오.

출처

2016-08-11 10:34:13 trex

그런 정교한 해결책을 얻는 데 시간이 없었 겠지만, 결과를보고 해 주셔서 감사합니다. 적어도 초기 문제를 발견하는 데 도움을주었습니다. 건배! –

많은 감사의 @trex, 나는이 요구 사항을 그냥이 방법을 설정했다. –

매핑이 중괄호에 구문 오류가있어 해결책이 작동하지 않습니다 ??? – tina

나는 쿼리가 잘못되었다고 생각합니다. 인덱싱 할 때 nGram이 필요하지만 검색 할 때 필요하지 않습니다. 검색 할 때 가능한 한 "고정 된"텍스트가 필요합니다. 대신이 쿼리 :

{ 
    "query": { 
    "multi_match": { 
     "query": " dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
}

당신은 dementia 전에 두 개의 공백을 확인할 수 있습니다. 그것들은 분석가가 텍스트에서 설명합니다.

"edge_ngram_analyzer": { 
     "filter": [ 
     "lowercase","trim" 
     ], 
     "tokenizer": "edge_ngram_tokenizer" 
    }

그리고이 쿼리 (NO 공백 dementia 전에) 작동 : 올바른 해결책을 찾기 위해 나에게 도움이 rendel에

{ 
    "query": { 
    "multi_match": { 
     "query": "dementia in alz", 
     "analyzer": "keyword", 
     "fields": [ 
     "_all" 
     ] 
    } 
    } 
}

출처

2016-08-09 14:16:11

치매를 사용했습니다. – trex

또한 기존의 매핑 인 "keyword_analyzer"에 다음 분석기를 추가하여 소문자 필터를 사용하여 테스트를 반복했습니다. { "filter": [ "lowercase"], "tokenizer": "keyword"}'. 검색어는'{ 'query': { 'multi_match': { '검색어': 'alz'의 치매, 'analyzer': 'keyword_analyzer', 'fields': [ '_all']}}}'입니다. 성공하지 못했습니다 :-( – trex

테스트 할 전체 문서와 인덱스의 ** 완전 ** 매핑을 참조해야합니다. 참고로 사용하십시오. –

문구가 일치하는 가장자리 NGram

답변

Andrei Stefan의 솔루션이 최적이 아닙니다.

최적의 솔루션입니다.

관련 문제