Windows에서 FFTW가 Linux보다 빠른 이유는 무엇입니까?

fftw 라이브러리 (fftw3.a, fftw3.lib)를 사용하여 Linux와 Windows에서 동일한 프로그램을 두 개 작성하고 fftwf_execute(m_wfpFFTplan) 문 (16-fft)의 지속 시간을 계산했습니다. 리눅스에Windows에서 FFTW가 Linux보다 빠른 이유는 무엇입니까?

: 10000 개 실행을 위해

평균 시간은 0.9 Windows에서
입니다 : 평균 시간 나는이 아홉 배 빠른 Windows에서입니다 이유에 혼란 스러워요 0.12

입니다 리눅스보다.

프로세서 : 2.93GHz에는

각 OS @ 인텔 (R) 코어 (TM) i7의 CPU를 870 (윈도우 XP 32 비트 및 리눅스 오픈 수세 11.4 32 비트) 같은 컴퓨터에 설치됩니다.

인터넷에서 fftw.lib (Windows 용)을 다운로드 했으므로 해당 구성을 알 수 없습니다. 나는이 설정으로 FFTW를 빌드하면 리눅스에서

/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx

과 기본 CONFIGS (0.4 밀리 초)보다 4 배 더 빠른 lib에 발생합니다.

출처

2011-12-31 Ali Nima

fftw 라이브러리가 동일한 컴파일러 (및 버전)로 컴파일되어 있습니까? 그들은 같은 깃발로 컴파일되어 있습니까? 동일한 아키텍처 용으로 컴파일 되었습니까? 아마도 Windows 빌드가 CPU 기능을 더 잘 사용하고있을 것입니다. 어쩌면 다른 컴파일러 일 수도 있으므로 다르게 최적화됩니다. –

두 플랫폼에서 사용한 컴파일러의 세부 정보와 코드 및 라이브러리 구축에 사용되는 컴파일 옵션을 게시하십시오. – Mat

은 비트율이 동일합니다 (32 비트 대 64 비트)? 동일한 양의 RAM을 사용할 수 있습니까? 어떤 다른 프로세스가 병렬로 실행되고 있습니까? 모든 가상화가 활성화 되었습니까? – Yahia

16 FFT가 매우 작습니다. 당신이 발견 할 수있는 것은 64라고 말하는 것보다 작은 FFT는 최상의 성능을 얻기 위해 루프가없는 하드 코딩 된 어셈블러가 될 것입니다. 이는 명령 세트, 컴파일러 최적화, 심지어 64 비트 또는 32 비트 워드의 변형에 매우 취약 할 수 있음을 의미합니다.

2의 거듭 제곱에서 16 -> 1048576의 FFT 크기 테스트를 실행하면 어떻게됩니까? Linux에서 특정 하드 코딩 된 asm 루틴이 컴퓨터에 최적화되어 있지 않을 수도 있지만 특정 크기의 Windows 구현에서는 운이 좋았을 수도 있습니다. 이 범위의 모든 크기를 비교해 보면 Linux와 Windows의 성능을보다 잘 파악할 수 있습니다.

FFTW를 보정 해 보셨습니까? FFTW를 처음 실행하면 컴퓨터 당 가장 빠른 구현을 추측하지만 특수 명령어 세트 나 특정 크기의 캐시 또는 기타 프로세서 기능이있는 경우 실행 속도에 큰 영향을 미칠 수 있습니다. 결과적으로 캘리브레이션을 수행하면 다양한 FFT 루틴의 속도를 테스트하고 특정 하드웨어에 가장 빠른 크기를 선택할 수 있습니다. 교정에는 반복적으로 계획을 계산하고 생성 된 FFTW "지혜"파일을 저장하는 작업이 포함됩니다. 저장된 교정 데이터 (이것은 긴 과정입니다)를 다시 사용할 수 있습니다. 소프트웨어가 시작되고 매번 파일을 다시 사용할 때 한 번 해보는 것이 좋습니다. 교정 후 특정 크기에 대해 4-10 배 성능 향상을 나타 냈습니다!

다음은 특정 크기에 대해 FFTW를 보정하는 데 사용한 코드 스 니펫입니다. 이 코드는 필자가 작업 한 DSP 라이브러리에서 그대로 붙여 넣어 졌으므로 일부 함수 호출은 내 라이브러리에만 적용됩니다. FFTW 특정 전화가 도움이되기를 바랍니다.

는

// Calibration FFTW 
void DSP::forceCalibration(void) 
{ 
// Try to import FFTw Wisdom for fast plan creation 
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r"); 

// If wisdom does not exist, ask user to calibrate 
if (fftw_wisdom == 0) 
{ 
    int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\ 
     "Would you like to perform a one-time calibration?\n\n"\ 
     "Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\ 
     "\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\ 
     "\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!", 
     MB_YESNO | MB_ICONSTOP, 0); 

    if (iStatus2 == IDYES) 
    { 
     // Perform calibration for all powers of 2 from 8 to 4194304 
     // (most heavily used FFTs - for signal processing) 
     AfxMessageBox("About to perform calibration.\n"\ 
      "Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\ 
      "Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n" 
      "\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n"); 
     startTimer(); 

     // Create a whole load of FFTw Plans (wisdom accumulates automatically) 
     for (int i = 8; i <= 4194304; i *= 2) 
     { 
      // Create new buffers and fill 
      DSP::cFFTin = new fftw_complex[i]; 
      DSP::cFFTout = new fftw_complex[i]; 
      DSP::fconv_FULL_Real_FFT_rdat = new double[i]; 
      DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1]; 
      for(int j = 0; j < i; j++) 
      { 
       DSP::fconv_FULL_Real_FFT_rdat[j] = j; 
       DSP::cFFTin[j][0] = j; 
       DSP::cFFTin[j][1] = j; 
       DSP::cFFTout[j][0] = 0.0; 
       DSP::cFFTout[j][1] = 0.0; 
      } 

      // Create a plan for complex FFT. 
      // Use the measure flag to get the best possible FFT for this size 
      // FFTw "remembers" which FFTs were the fastest during this test. 
      // at the end of the test, the results are saved to disk and re-used 
      // upon every initialisation of the DSP Library 
      DSP::pCF = fftw_plan_dft_1d 
       (i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE); 

      // Destroy the plan 
      fftw_destroy_plan(DSP::pCF); 

      // Create a plan for real forward FFT 
      DSP::pCF = fftw_plan_dft_r2c_1d 
       (i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE); 

      // Destroy the plan 
      fftw_destroy_plan(DSP::pCF); 

      // Create a plan for real inverse FFT 
      DSP::pCF = fftw_plan_dft_c2r_1d 
       (i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE); 

      // Destroy the plan 
      fftw_destroy_plan(DSP::pCF); 

      // Destroy the buffers. Repeat for each size 
      delete [] DSP::cFFTin; 
      delete [] DSP::cFFTout; 
      delete [] DSP::fconv_FULL_Real_FFT_rdat; 
      delete [] DSP::fconv_FULL_Real_FFT_cdat; 
     } 

     double time = stopTimer(); 

     char * strOutput; 
     strOutput = (char*) malloc (100); 
     sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\ 
      "Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\ 
      "to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60); 
     AfxMessageBox(strOutput); 

     isCalibrated = 1; 

     // Save accumulated wisdom 
     char * strWisdom = fftw_export_wisdom_to_string(); 
     FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w"); 
     fprintf(fftw_wisdomsave, "%s", strWisdom); 

     fclose(fftw_wisdomsave); 
     DSP::pCF = NULL; 
     DSP::cFFTin = NULL; 
     DSP::cFFTout = NULL; 
     fconv_FULL_Real_FFT_cdat = NULL; 
     fconv_FULL_Real_FFT_rdat = NULL; 
     free(strOutput); 
    } 
} 
else 
{ 
    // obtain file size. 
    fseek (fftw_wisdom , 0 , SEEK_END); 
    long lSize = ftell (fftw_wisdom); 
    rewind (fftw_wisdom); 

    // allocate memory to contain the whole file. 
    char * strWisdom = (char*) malloc (lSize); 

    // copy the file into the buffer. 
    fread (strWisdom,1,lSize,fftw_wisdom); 

    // import the buffer to fftw wisdom 
    fftw_import_wisdom_from_string(strWisdom); 

    fclose(fftw_wisdom); 
    free(strWisdom); 

    isCalibrated = 1; 

    return; 
} 
}

비밀 소스는 특히 특정 FFT의 유형 (실제 복잡한, 1D, 2D) 및 크기에 대한 빠른을 찾기 위해 루틴의 수백을 측정하는 FFTW_MEASURE 플래그를 사용하여 계획을 수립하는 것입니다

DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout, 
    FFTW_FORWARD, FFTW_MEASURE);

마지막으로 모든 벤치 마크 테스트는 디버거에서 최적화 및 디버거에서 분리 된 릴리스 모드로 컴파일 된 코드에서 호출 된 실행 외의 단일 FFT 계획 단계에서도 수행해야합니다.벤치 마크는 수천 (또는 수백만)의 반복을 반복하여 수행해야하며 평균 실행 시간을 사용하여 결과를 계산해야합니다. 계획 단계에는 많은 시간이 필요하며 실행 계획은 단일 계획으로 여러 번 수행되도록 설계되었습니다.

출처

2011-12-31 08:51:06

Windows에서 FFTW가 Linux보다 빠른 이유는 무엇입니까?

답변

관련 문제