PostgreSQL 데이터베이스 일괄 처리/분할

데이터를 일괄 처리하고 PostgreSQL (9.6, 업그레이드 할 수 있음) 데이터베이스를 채우는 프로젝트를 진행 중입니다. 현재 작동하는 방식은 프로세스가 별도의 단계에서 발생하고 각 단계가 소유하고있는 테이블에 데이터를 추가하는 것입니다 (두 테이블이 같은 테이블에 쓰는 경우는 거의 없으며 서로 다른 열에 쓰는 경우는 거의 없습니다).PostgreSQL 데이터베이스 일괄 처리/분할

데이터가 발생하는 방식은 각 단계마다 데이터가 점점 더 세분화되는 경향이 있습니다. 단순화 된 예로서 데이터 소스를 정의하는 하나의 테이블이 있습니다. 극소수 (수십/수백 개)의 데이터 소스가 거의 없지만 이러한 각 데이터 소스는 데이터 샘플의 일괄 처리를 생성합니다 (일괄 처리 및 샘플은 메타 데이터를 저장하는 별도의 테이블입니다). 각 배치는 일반적으로 약 50k 샘플을 생성합니다. 각 데이터 포인트는 단계별로 처리되며 각 데이터 샘플은 다음 테이블에서 더 많은 데이터 포인트를 생성합니다.

우리가 샘플 테이블에서 1.5mil 행을 얻을 때까지는 정상적으로 작동했습니다 (우리의 관점에서는 많은 데이터가 아닙니다). 이제 배치에 대한 필터링이 느리게 시작됩니다 (우리가 검색하는 각 샘플에 대해 약 10ms). 그리고 일괄 처리를위한 데이터를 얻는 실행 시간이 5-10 분 (가져 오기는 ms)이 걸리기 때문에 병목 현상이 발생하기 시작합니다.

우리는 이러한 쿼리에 관련된 모든 외래 키에 b- 트리 인덱스를 가지고 있습니다.

우리의 계산은 배치를 대상으로하기 때문에 일반적으로 계산 중에 배치를 통해 쿼리 할 필요가 없습니다. (이것은 쿼리 시간이 지금 많이 많이 아플 때입니다.) 그러나 데이터 분석을 위해 일괄 처리에 대한 임의 (ad-hoc) 질의가 가능해야합니다.

그래서 매우 간단한 해결책은 각 배치에 대해 개별 데이터베이스를 생성하고 필요할 때 이들 데이터베이스를 통해 쿼리하는 것입니다. 각 데이터베이스에 하나의 일괄 처리가 있다면 분명히 단일 일괄 처리에 대한 필터링은 즉시 이루어질 수 있으며 문제는 해결됩니다 (현재). 그러나 수천 개의 데이터베이스로 끝나고 데이터 분석이 어려워 질 것입니다.

PostgreSQL에서 일부 쿼리에 대해 별도의 데이터베이스를 사용하고있는 것으로 가장 할 수 있습니까? 이상적으로는 새로운 배치를 "등록"할 때 각 배치마다이 작업을 수행하고 싶습니다.

PostgreSQL의 세계 이외에, 나의 유스 케이스에 시도해야 할 다른 데이터베이스가 있습니까?

편집 : DDL/스키마

현재 구현에서

, sample_representation는 모든 처리 결과에 의존하는 테이블입니다. 일괄 처리는 실제로 (batch.id, representation.id)의 튜플로 정의됩니다. 내가하고 느린 전술 시도 쿼리

SELECT sample_representation.id, sample.sample_pos 
FROM sample_representation 
JOIN sample ON sample.id = sample_representation.id_sample 
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

우리는 현재 어딘가에 1.5의 sample의 2 개 representation의 460 개 batch ES 주위가 (50K 샘플 약 5 분까지 추가 각 샘플은 10ms)입니다 (49 개가 처리되었으며 다른 샘플은 연관된 샘플이 없음) 이는 각 배치가 평균 30k 샘플을 가짐을 의미합니다. 일부에는 약 50k가 있습니다.

스키마는 아래와 같습니다. 모든 테이블과 관련된 일부 메타 데이터가 있지만이 경우 쿼리하지 않습니다. 실제 샘플 데이터는 차이가 나는 경우를 대비해 디스크에 별도로 저장되며 데이터베이스에는 저장되지 않습니다.

create table batch 
(
    id uuid default uuid_generate_v1mc() not null 
     constraint batch_pk 
      primary key, 
    path text not null 
     constraint unique_batch_path 
      unique, 
    id_data_source uuid 
) 
; 
create table sample 
(
    id uuid default uuid_generate_v1mc() not null 
     constraint sample_pk 
      primary key, 
    sample_pos integer, 
    id_batch uuid 
     constraint batch_fk 
      references batch 
       on update cascade on delete set null 
) 
; 
create index sample_sample_pos_index 
    on sample (sample_pos) 
; 
create index sample_id_batch_sample_pos_index 
    on sample (id_batch, sample_pos) 

; 
create table representation 
(
    id uuid default uuid_generate_v1mc() not null 
     constraint representation_pk 
      primary key, 
    id_data_source uuid 
) 
; 
create table data_source 
(
    id uuid default uuid_generate_v1mc() not null 
     constraint data_source_pk 
      primary key 
) 
; 
alter table batch 
    add constraint data_source_fk 
     foreign key (id_data_source) references data_source 
      on update cascade on delete set null 
; 
alter table representation 
    add constraint data_source_fk 
     foreign key (id_data_source) references data_source 
      on update cascade on delete set null 
; 
create table sample_representation 
(
    id uuid default uuid_generate_v1mc() not null 
     constraint sample_representation_pk 
      primary key, 
    id_sample uuid 
     constraint sample_fk 
      references sample 
       on update cascade on delete set null, 
    id_representation uuid 
     constraint representation_fk 
      references representation 
       on update cascade on delete set null 
) 
; 
create unique index sample_representation_id_sample_id_representation_uindex 
    on sample_representation (id_sample, id_representation) 
; 
create index sample_representation_id_sample_index 
    on sample_representation (id_sample) 
; 
create index sample_representation_id_representation_index 
    on sample_representation (id_representation) 
;

출처

2017-12-21 P.R.

'... 50k 샘플. 그러면 각각의 데이터 포인트가 단계별로 처리됩니다 ... '의미는 다음과 같습니다. 한 포인트를 검색 한 후 다음 포인트를 검색하고 다음 포인트를 최대 50K 번 검색합니다. 한꺼번에 모두 검색하지 않는 이유는 무엇입니까? – joop

"* PostgreSQL에서 일부 쿼리에 대해 별도의 데이터베이스를 사용하는 척하는 방법이 있습니까? - 다른 데이터에 [shard] (https://en.wikipedia.org/wiki/Shard_ (database_architecture)) 할 수 있습니다. [foreign tables] (https://www.postgresql.org/docs/current/static/sql-alterforeigntable.html) –

을 사용하여 한 서버에서 투명하게 액세스합니다. @joop이 50k 레코드를 검색합니다 (약 1 데이터 시간)은 한 번에 5 ~ 10 분 정도 걸립니다. ('select tblA.propery_a, tblB.propery_b from tblA JOIN tblB on tblB.id_tblA = tblA.id where tblB.batch_id ='some-uuid '및 tblA.some_fk ='another-uuid ') –

주위 하구 후, 나는 해결책을 찾아 냈다. 하지만 난 여전히 원래의 질의 정말 많은 시간이 소요 왜 안 확신 :

SELECT sample_representation.id, sample.sample_pos 
FROM sample_representation 
JOIN sample ON sample.id = sample_representation.id_sample 
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'

모든 인덱싱되지만 테이블 sample_representation과 sample 150 만 개 행이 상대적으로 크다. 무슨 일이 일어날 지 추측 해보면 먼저 표가 조인 된 다음 WHERE으로 필터링된다는 것입니다. 그러나 조인의 결과로 큰 뷰를 만드는 경우에도 그렇게 오래 걸리지 않을 것입니다!

어쨌든 두 개의 "거대한"테이블을 결합하는 대신 CTE를 사용하려고했습니다. 이 필터는 일찍 필터링 한 다음 나중에 참여하는 것이 었습니다.

WITH sel_samplerepresentation AS (
    SELECT * 
    FROM sample_representation 
    WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a' 
), sel_samples AS (
    SELECT * 
    FROM sample 
    WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91' 
) 
SELECT sel_samples.sample_pos, sel_samplerepresentation.id 
FROM sel_samplerepresentation 
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation

이 쿼리에도 영원히 걸립니다. 여기에 이유가 분명합니다. sel_samples 및 sel_samplerepresentation에는 각각 50k 개의 레코드가 있습니다. 조인은 CTE의 인덱싱되지 않은 열에서 발생합니다.

열팽창 계수에 대한 인덱스가 없기 때문에, 내가 인덱스를 추가 할 수있는 전망을 구체화로 내가 그들을 공식화 :

CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
    SELECT * 
    FROM sample_representation 
    WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a' 
); 

CREATE MATERIALIZED VIEW sel_samples AS (
    SELECT * 
    FROM sample 
    WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91' 
); 

CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample); 
CREATE INDEX sel_samples_id_index ON sel_samples (id); 

SELECT sel_samples.sample_pos, sel_samplerepresentation.id 
FROM sel_samplerepresentation 
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample; 

DROP MATERIALIZED VIEW sel_samplerepresentation; 
DROP MATERIALIZED VIEW sel_samples;

이이 솔루션에 비해 해킹의 더,하지만 이러한 쿼리를 실행하면 1 초 걸립니다! (8 분에서부터)

출처

2017-12-29 11:51:51

PostgreSQL 데이터베이스 일괄 처리/분할

답변

관련 문제