순차 실행을 위해 프로그램 최적화시 openMP를 사용한 후 성능이 향상되지 않습니다.

순차적 실행을 위해 필자의 기능만큼 최적화했습니다. openMP를 사용할 때 성능이 향상되지 않습니다. 1 개의 코어가 있고 8 개의 코어가있는 머신에서 프로그램을 시험해 보았는데 성능은 같습니다.
연도를 20으로 설정하면
코어 1 개가 있습니다.
코어 8 개 : 1 초.순차 실행을 위해 프로그램 최적화시 openMP를 사용한 후 성능이 향상되지 않습니다.

올해를 25로 설정하면
코어 1 개 : 40 초입니다.
코어 8 개 : 40 초.

1 핵심 기계 : 내 노트북의 인텔 코어 2 듀오 1.8 GHz의, 우분투는 리눅스
8 코어 기계 : 3.25 GHz의, 우분투는 리눅스

내 프로그램은 이항 트리의 모든 가능한 경로를 열거하고 몇 가지 일을

각 경로. 따라서 루프 크기가 기하 급수적으로 증가하고 openMP 스레드의 공간이 제로가 될 것으로 예상됩니다. 내 루프에서는 하나의 변수 만 축소합니다. 다른 모든 변수는 읽기 전용입니다. 나는 쓰는 함수 만 사용하고 스레드 안전하다고 생각한다.

또한 내 프로그램에서 Valgrind cachegrind를 실행합니다. 나는 출력을 완전히 이해하지 못하지만 캐시 미스 나 잘못된 공유가없는 것처럼 보입니다.

나는 나의 완전한 프로그램은 다음과 같습니다

gcc -O3 -g3 -Wall -c -fmessage-length=0 -lm -fopenmp -ffast-math

로 컴파일합니다. 코드를 많이 게시하여 죄송합니다. OpenMP 나 C에 익숙하지 않아 주 작업을 잃지 않고 코드를 더 이상 재개 할 수 없습니다.

OpenMP를 사용할 때 성능을 어떻게 향상시킬 수 있습니까?
프로그램을 더 빨리 실행할 수있는 컴파일러 플래그 또는 C 트릭이 있습니까?

TEST.C

#include <stdio.h> 
#include <stdlib.h> 
#include <math.h> 
#include <omp.h> 
#include "test.h" 

int main(){ 

    printf("starting\n"); 
    int year=20; 
    int tradingdate0=1; 

    globalinit(year,tradingdate0); 

    int i; 
    float v=0; 
    long n=pow(tradingdate0+1,year); 
    #pragma omp parallel for reduction(+:v) 
    for(i=0;i<n;i++) 
     v+=pathvalue(i); 

    globaldel(); 
    printf("finished\n"); 
    return 0; 
} 

//***function on which openMP is applied 
float pathvalue(long pathindex) { 
    float value = -ctx.firstpremium; 
    float personalaccount = ctx.personalaccountat0; 
    float account = ctx.firstpremium; 
    int i; 
    for (i = 0; i < ctx.year-1; i++) { 
     value *= ctx.accumulationfactor; 
     double index = getindex(i,pathindex); 
     account = account * index; 
     double death = fmaxf(account,ctx.guarantee[i]); 
     value += qx(i) * death; 
     if (haswithdraw(i)){ 
      double withdraw = personalaccount*ctx.allowed; 
      value += px(i) * withdraw; 
      personalaccount = fmaxf(personalaccount-withdraw,0); 
      account = fmaxf(account-withdraw,0); 
     } 
    } 

    //last year 
    double index = getindex(ctx.year-1,pathindex); 
    account = account * index; 
    value+=fmaxf(account,ctx.guarantee[ctx.year-1]); 

    return value * ctx.discountfactor; 
} 



int haswithdraw(int period){ 
    return 1; 
} 

float getindex(int period, long pathindex){ 
    int ndx = (pathindex/ctx.chunksize[period])%ctx.tradingdate; 
    return ctx.stock[ndx]; 
} 

float qx(int period){ 
    return 0; 
} 

float px(int period){ 
    return 1; 
} 

//****global 
struct context ctx; 

void globalinit(int year, int tradingdate0){ 
    ctx.year = year; 
    ctx.tradingdate0 = tradingdate0; 
    ctx.firstpremium = 1; 
    ctx.riskfreerate = 0.06; 
    ctx.volatility=0.25; 
    ctx.personalaccountat0 = 1; 
    ctx.allowed = 0.07; 
    ctx.guaranteerate = 0.03; 
    ctx.alpha=1; 
    ctx.beta = 1; 
    ctx.tradingdate=tradingdate0+1; 
    ctx.discountfactor = exp(-ctx.riskfreerate * ctx.year); 
    ctx.accumulationfactor = exp(ctx.riskfreerate); 
    ctx.guaranteefactor = 1+ctx.guaranteerate; 
    ctx.upmove=exp(ctx.volatility/sqrt(ctx.tradingdate0)); 
    ctx.downmove=1/ctx.upmove; 

    ctx.stock=(float*)malloc(sizeof(float)*ctx.tradingdate); 
    int i; 
    for(i=0;i<ctx.tradingdate;i++) 
     ctx.stock[i]=pow(ctx.upmove,ctx.tradingdate0-i)*pow(ctx.downmove,i); 

    ctx.chunksize=(long*)malloc(sizeof(long)*ctx.year); 
    for(i=0;i<year;i++) 
     ctx.chunksize[i]=pow(ctx.tradingdate,ctx.year-i-1); 

    ctx.guarantee=(float*)malloc(sizeof(float)*ctx.year); 
    for(i=0;i<ctx.year;i++) 
     ctx.guarantee[i]=ctx.beta*pow(ctx.guaranteefactor,i+1); 
} 

void globaldel(){ 
    free(ctx.stock); 
    free(ctx.chunksize); 
    free(ctx.guarantee); 
}

test.h

float pathvalue(long pathindex); 
int haswithdraw(int period); 
float getindex(int period, long pathindex); 
float qx(int period); 
float px(int period); 
//***global 
struct context{ 
    int year; 
    int tradingdate0; 
    float firstpremium; 
    float riskfreerate; 
    float volatility; 
    float personalaccountat0; 
    float allowed; 
    float guaranteerate; 
    float alpha; 
    float beta; 
    int tradingdate; 
    float discountfactor; 
    float accumulationfactor; 
    float guaranteefactor; 
    float upmove; 
    float downmove; 
    float* stock; 
    long* chunksize; 
    float* guarantee; 
}; 
struct context ctx; 
void globalinit(); 
void globaldel();

편집 나는 상수로 모든 글로벌 변수를 단순화합니다. 20 년 동안이 프로그램은 두 번 더 빨리 실행됩니다. 예를 들어 스레드 수를 OMP_NUM_THREADS=4 ./test으로 설정하려고했습니다. 그러나 그것은 나에게 어떤 성능 이득도주지 않았다.
gcc에 문제가 있습니까?

TEST.C

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h> 
#include <math.h> 
#include <omp.h> 
#include "test.h" 


int main(){ 

    starttimer(); 
    printf("starting\n"); 
    int i; 
    float v=0; 

    #pragma omp parallel for reduction(+:v) 
    for(i=0;i<numberofpath;i++) 
     v+=pathvalue(i); 

    printf("v:%f\nfinished\n",v); 
    endtimer(); 
    return 0; 
} 

//function on which openMP is applied 
float pathvalue(long pathindex) { 
    float value = -firstpremium; 
    float personalaccount = personalaccountat0; 
    float account = firstpremium; 
    int i; 
    for (i = 0; i < year-1; i++) { 
     value *= accumulationfactor; 
     double index = getindex(i,pathindex); 
     account = account * index; 
     double death = fmaxf(account,guarantee[i]); 
     value += death; 
     double withdraw = personalaccount*allowed; 
     value += withdraw; 
     personalaccount = fmaxf(personalaccount-withdraw,0); 
     account = fmaxf(account-withdraw,0); 
    } 

    //last year 
    double index = getindex(year-1,pathindex); 
    account = account * index; 
    value+=fmaxf(account,guarantee[year-1]); 

    return value * discountfactor; 
} 



float getindex(int period, long pathindex){ 
    int ndx = (pathindex/chunksize[period])%tradingdate; 
    return stock[ndx]; 
} 

//timing 
clock_t begin; 

void starttimer(){ 
    begin = clock(); 
} 

void endtimer(){ 
    clock_t end = clock(); 
    double elapsed = (double)(end - begin)/CLOCKS_PER_SEC; 
    printf("\nelapsed: %f\n",elapsed); 
}

test.h

float pathvalue(long pathindex); 
int haswithdraw(int period); 
float getindex(int period, long pathindex); 
float qx(int period); 
float px(int period); 
//timing 
void starttimer(); 
void endtimer(); 
//***constant 
const int year= 20 ; 
const int tradingdate0= 1 ; 
const float firstpremium= 1 ; 
const float riskfreerate= 0.06 ; 
const float volatility= 0.25 ; 
const float personalaccountat0= 1 ; 
const float allowed= 0.07 ; 
const float guaranteerate= 0.03 ; 
const float alpha= 1 ; 
const float beta= 1 ; 
const int tradingdate= 2 ; 
const int numberofpath= 1048576 ; 
const float discountfactor= 0.301194211912 ; 
const float accumulationfactor= 1.06183654655 ; 
const float guaranteefactor= 1.03 ; 
const float upmove= 1.28402541669 ; 
const float downmove= 0.778800783071 ; 
const float stock[2]={1.2840254166877414, 0.7788007830714049}; 
const long chunksize[20]={524288, 262144, 131072, 65536, 32768, 16384, 8192, 4096, 2048, 1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1}; 
const float guarantee[20]={1.03, 1.0609, 1.092727, 1.1255088100000001, 1.1592740743, 1.1940522965290001, 1.2298738654248702, 1.2667700813876164, 1.304773183829245, 1.3439163793441222, 1.384233870724446, 1.4257608868461793, 1.4685337134515648, 1.512589724855112, 1.557967416600765, 1.6047064390987882, 1.6528476322717518, 1.7024330612399046, 1.7535060530771016, 1.8061112346694148};

출처

2012-05-23 Anonymous

이미 순차 코드에 대한 성능이 향상되었으므로 항상 시작해야합니다. 매개 변수가있는 전역 구조는 기본적으로 컴파일러가 최적화 할 수있는 모든 가능성을 없애줍니다. 규칙은 모든 상수를 상수 (정수의 경우는 "enum"또는 부동 소수점의 경우 #define)로하고 모든 런타임 매개 변수를 함수의 인수로 전달하는 간단한 규칙입니다. 컴파일러가하는 방식으로 프로그램의 다른 부분이'struct'의 특정 값을 변경하지 않으므로 상수 전파를 할 수 없다는 것을 확신 할 수 없습니다. 청소는 병렬 편집에도 도움이됩니다. –

@JensGustedt 글로벌 변수를 관리하는 올바른 방법을 알려 주셔서 감사합니다. 내 코드를 2 배 더 빠르게 만들었습니다 (제 질문에 편집을 참조하십시오). 그래도 병렬화에서 이득이 없다. –

Nicolas, 당신은 직접 따라 가지 않았습니다. 여러 가지 .o 파일이있는 프로그램을 작성하자 마자 다중 정의 된 기호로 어려움을 겪게됩니다. gcc에 문제가있는 경우 우리는 말할 수 없습니다. 사용하는 버전을 알려주지 않았습니다. OpenMP가 차이를 만드는 지 확인하려면 (-O3 -S'를 사용하여) 프로그램을 어셈블리로 컴파일하고 결과 코드와'-fopenmp'를 비교합니다. –

OpenMP를 사용하면 프로그램의 이점이 있더라도 잘못된 시간을 측정하고 있기 때문에 프로그램을 볼 수 없습니다.

clock()는 모든 스레드에서 소비 된 총 CPU 시간을 반환합니다. 네 개의 스레드로 실행하고 각각이 1/4의 시간 동안 실행되는 경우 clock()은 4 * (1/4) = 1부터 동일한 값을 반환합니다. 대신 벽시계 시간을 측정해야합니다.

clock()으로의 전화를 omp_get_wtime() 또는 gettimeofday()으로 바꿉니다. 둘 다 고정밀 벽시계 타이밍을 제공합니다.

P.S. 왜 주위에 그렇게 많은 사람들이 타이밍을 위해 clock()을 사용하고 있습니까?

출처

2012-05-24 11:45:20

아주 좋은 통찰력. 그게 내 문제 였어. 시간을 정확히 측정 할 때, 1 코어와 8 코어 머신 사이에 7 배의 속도가 나타납니다. 고맙습니다. 내 경우,'clock()'을 사용하는 것은 새내기 때문이었다. –

난 당신이 OpenMP를 사용하게 될 코어의 수를 지정하고있는 모든 섹션이 표시되지 않습니다. 그것은 기본적으로 볼 수있는 CPU 수를 사용하기로되어 있지만, 내 용도로는 항상 지정한만큼 많은 수의 CPU를 사용해야합니다.

는 구조에 대한 병렬 전에이 줄을 추가

#pragma omp parallel num_threads(num_threads) 
{ 
    // Your parallel for follows here 
}

을 ...여기서 num_threads은 1에서 시스템의 코어 수 사이의 정수입니다.

편집 : 다음은 코드를 작성하는 데 사용되는 메이크 파일입니다. 동일한 디렉토리에 Makefile이라는 텍스트 파일에이 파일을 배치하십시오.

test: test.c test.h 
    cc -o [email protected] $< -O3 -g3 -fmessage-length=0 -lm -fopenmp -ffast-math

출처

2012-05-23 21:14:51 Makoto

Makoto, IMO 니콜라스가 스피드 업을 보지 못하는 이유는 아닙니다. (그의 머신이 싱글 코어가 아니라면). –

@AaterSuleman : OpenMP 어프로치를 처리 할 때 스레드 수를 지정해야합니다. 전역 변수가 될 수도 있고이를 통해 가능할 수도 있습니다. – Makoto

다른 점을 지적하지 않는 한, 사용 가능한 코어 수로 설정합니다. 따라서 8 코어 시스템에서는 스레드를 지정하지 않아도 8 개의 스레드 (또는 HT의 경우 16 개 스레드)가있게됩니다. –

마치 제대로 작동하는 것 같습니다. 아마도 쓰레드 수를 지정해야 할 것입니다. OMP_NUM_THREADS 변수를 설정하면됩니다. 난 그냥 코드를 컴파일 및 스레드의 수를 변경할 때 상당한 속도 향상을 준수 :

OMP_NUM_THREADS=4 ./test

편집 : 예를 들어, 4 개 스레드를 사용하여.

출처

2012-05-23 21:18:29 betabandido

당신의 접근 방식을 시도했지만 성능이 제 1 코어와 제 8 코어 머신 사이에서 동일합니다. gcc 명령을 게시 할 수 있습니까? –

@ NicolasEssis-Breton 게시 한 명령 줄과 똑같은 명령 줄을 사용했습니다. 유일한 차이점은 내가 올해를 22로 늘 렸습니다 (프로그램이 너무 빨라서 어떤 속도 향상도 측정 할 수 없었 음). 년 = 22의 경우 1에서 4 개의 쓰레드로 갈 때 2 배의 속도 향상이있었습니다 (내 머신의 코어가 4 개). 선형 속도 향상이 아니지만 확실히 중요합니다. – betabandido

순차 실행을 위해 프로그램 최적화시 openMP를 사용한 후 성능이 향상되지 않습니다.

답변

관련 문제