다중 스레드 random_r이 단일 스레드 버전보다 느림

다음 프로그램은 본질적으로 here과 동일합니다. 나는 실행하고 두 개의 스레드 (NTHREADS == 2), 내가 얻을 다음과 같은 런타임을 사용하여 프로그램을 컴파일 할 때 :다중 스레드 random_r이 단일 스레드 버전보다 느림

real  0m14.120s 
user  0m25.570s 
sys   0m0.050s

그것은 단지 하나의 스레드 (NTHREADS == 1), I 실행 얻을 시대와 함께 실행하면 하나의 코어 만 사용하더라도 훨씬 더 좋습니다.

real  0m4.705s 
user  0m4.660s 
sys   0m0.010s

내 시스템은 듀얼 코어이며, 나는 random_r 스레드 안전 알고 나는 비 차단 확신합니다. random_r없이 동일한 프로그램을 실행하고 코사인 및 사인 계산을 대체로 사용하면 이중 스레드 버전이 예상 된 시간의 약 1/2에서 실행됩니다. 임의의 숫자를 생성 할 때 두 개의 스레드 버전 random_r은 멀티 스레드 애플리케이션에 사용하기위한 것입니다 고려하여 단일 스레드 버전보다 훨씬 더 수행하는 이유

#include <pthread.h> 
#include <stdlib.h> 
#include <stdio.h> 

#define NTHREADS 2 
#define PRNG_BUFSZ 8 
#define ITERATIONS 1000000000 

void* thread_run(void* arg) { 
    int r1, i, totalIterations = ITERATIONS/NTHREADS; 
    for (i = 0; i < totalIterations; i++){ 
     random_r((struct random_data*)arg, &r1); 
    } 
    printf("%i\n", r1); 
} 

int main(int argc, char** argv) { 
    struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data)); 
    char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ); 
    pthread_t* thread_ids; 
    int t = 0; 
    thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t)); 
    /* create threads */ 
    for (t = 0; t < NTHREADS; t++) { 
     initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]); 
     pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]); 
    } 
    for (t = 0; t < NTHREADS; t++) { 
     pthread_join(thread_ids[t], NULL); 
    } 
    free(thread_ids); 
    free(rand_states); 
    free(rand_statebufs); 
}

나는 혼란 스러워요. 공간

출처

2010-06-08 Nixuz

아주 간단한 변화 메모리에 데이터 아웃 : 내 듀얼 코어 시스템에서 훨씬 빠른 실행 시간에

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data)); 
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ); 
pthread_t* thread_ids; 
int t = 0; 
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t)); 
/* create threads */ 
for (t = 0; t < NTHREADS; t++) { 
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]); 
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]); 
}

결과.

이렇게하면 테스트 할 의심이 들었을 것입니다. 즉, 두 개의 개별 스레드에서 동일한 캐시 라인의 값을 변경하고 캐시 경합이 발생했음을 확인할 수 있습니다. Herb Sutter의 'machine architecture - what your programming language never told you' talk은 아직 알지 못한다면 시간이 있다면 시청할 가치가 있습니다. 약 1:20에 시작하는 허위 공유를 보여줍니다.

캐시 라인 크기를 계산하고 각 스레드의 데이터를 정렬하여 만듭니다.

그것은 그 구조체에 모든 스레드의 데이터를 plonk 및 정렬 조금 청소기입니다 :

#define CACHE_LINE_SIZE 64 

struct thread_data { 
    struct random_data random_data; 
    char statebuf[PRNG_BUFSZ]; 
    char padding[CACHE_LINE_SIZE - sizeof (struct random_data)-PRNG_BUFSZ]; 
}; 

int main (int argc, char** argv) 
{ 
    printf ("%zd\n", sizeof (struct thread_data)); 

    void* apointer; 

    if (posix_memalign (&apointer, sizeof (struct thread_data), NTHREADS * sizeof (struct thread_data))) 
     exit (1); 

    struct thread_data* thread_states = apointer; 

    memset (apointer, 0, NTHREADS * sizeof (struct thread_data)); 

    pthread_t* thread_ids; 

    int t = 0; 

    thread_ids = (pthread_t*) calloc (NTHREADS, sizeof (pthread_t)); 

    /* create threads */ 
    for (t = 0; t < NTHREADS; t++) { 
     initstate_r (random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data); 
     pthread_create (&thread_ids[t], NULL, &thread_run, &thread_states[t].random_data); 
    } 

    for (t = 0; t < NTHREADS; t++) { 
     pthread_join (thread_ids[t], NULL); 
    } 

    free (thread_ids); 
    free (thread_states); 
}

64 CACHE_LINE_SIZE로 :

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread 
refugio:$ time bin/nixuz_random_r 
64 
63499495 
944240966 

real 0m1.278s 
user 0m2.540s 
sys 0m0.000s

또는 더블 캐시 라인 크기를 사용할 수 있습니다, malloc을 사용합니다 - 여분의 패딩은 malloc이 64 바이트 정렬이 아닌 16 (IIRC)이기 때문에 변형 된 메모리가 별도의 행에 있음을 보장합니다.

이 관련인지 아닌지 모르겠어요

출처

2010-06-08 19:30:44

어. 이것은 많은 스레드가 여러 부분에 쓰려고 시도하는 작거나 밀집된 구조를 거의 무시할 수 있습니까? –

도움을 주신 것에 감사드립니다. 혼자 힘으로는 알지 못했을 것입니다. Ps. rand_states와 rand_statebufs를 스레드로 옮긴 다음 거기에서 난수 생성기를 초기화했습니다. 또한 캐시 문제를 매우 간단하게 해결합니다. – Nixuz

@Nicholas : 네. 그것은 메모리를 지나치게 의미하지 않는 것이 좋습니다. 스레드 로컬 할당을 함께 포장하면 도움이 될 것입니다. 너무 많은 캐시 경합과 잠금을 피할 수 있기 때문에 스레드 로컬라이제이션은 엄청난 승리가 될 수 있습니다. –

(차라리 바보 빠른 컴퓨터를하는 것보다 10 배 반복을 감소) -하지만 난 그냥 느린 (크기의 순서를 매우 유사한 행동을 보았다 하나보다 2 개 스레드)와 함께 ... 나는 기본적으로 변경 :

srand(seed); 
    foo = rand();

2 개 스레드 (A

myseed = seed; 
    foo = rand_r(&myseed);

로하고 "고정"이다 안정적으로 거의 두 배 빠른 - 예를 들면 1 35s 대신 9s).

rand()의 내부에서 잠금 또는 캐시 일관성 문제가있을 수 있습니다. 어쨌든 random_r()도 있으므로 어쩌면 당신에게 (1 년 전) 또는 다른 사람에게 유용 할 것입니다.

출처

2012-04-15 01:28:51 Jerry

다중 스레드 random_r이 단일 스레드 버전보다 느림

답변

관련 문제