이 중첩 된 for 루프를 병렬 처리하는 CUDA를 사용할 수 있습니까?

이 중첩 된 for 루프를 빠르게하려면 CUDA를 배우십시오. 어떻게 CUDA를 사용하여이 C++ 코드를 병렬 처리 할 수 있습니까?이 중첩 된 for 루프를 병렬 처리하는 CUDA를 사용할 수 있습니까?

#define PI 3.14159265 
using namespace std; 
int main() 
{ 
    int nbint = 2; 
    int hits = 20; 
    int nbinp = 2; 
    float _theta, _phi, _l, _m, _n, _k = 0, delta = 5; 
    float x[20],y[20],z[20],a[20],t[20]; 
    for (int i = 0; i < hits; ++i) 
    { 
     x[i] = rand()/(float)(RAND_MAX/100); 
    } 
    for (int i = 0; i < hits; ++i) 
    { 
     y[i] = rand()/(float)(RAND_MAX/100); 
    } 
    for (int i = 0; i < hits; ++i) 
    { 
     z[i] = rand()/(float)(RAND_MAX/100); 
    } 
    for (int i = 0; i < hits; ++i) 
    { 
     a[i] = rand()/(float)(RAND_MAX/100); 
    } 
    float maxforall = 1e-6; 
    float theta0; 
    float phi0; 
    for (int i = 0; i < nbint; i++) 
    { 
     _theta = (0.5 + i)*delta; 
     for (int j = 0; j < nbinp; j++) 
     { 
      _phi = (0.5 + j)*delta/_theta; 
      _l = sin(_theta* PI/180.0)*cos(_phi* PI/180.0); 
      _m = sin(_theta* PI/180.0)*sin(_phi* PI/180.0); 
      _n = cos(_theta* PI/180.0); 
      for (int k = 0; k < hits; k++) 
      { 
       _k = -(_l*x[k] + _m*y[k] + _n*z[k]); 
       t[k] = a[k] - _k; 
      } 

      qsort(t, 0, hits - 1); 
      float max = t[0]; 
      for (int k = 0; k < hits; k++) 
      { 
       if (max < t[k]) 
        max = t[k]; 
      } 
      if (max > maxforall) 
      { 
       maxforall = max; 
      } 

     } 
    } 
    return 0; 
}

가장 안쪽의 for 루프와 정렬 부분 (전체 중첩 루프)을 병렬로 넣으 려합니다. 배열을 정렬 한 후 모든 배열의 최대 값을 찾았습니다. 나는 코드를 단순화하기 위해 최대 값을 사용한다. 필자가 정렬해야하는 이유는 최대 값이 을 나타내는 것은 연속적인 시간 정보입니다 (모든 배열에는 시간 정보가 포함되어 있습니다). 정렬 부분은 그 시간을 가장 낮은 것에서 가장 높은 것으로 만듭니다. 그런 다음 특정 시간 간격 (단일 값 아님)을 비교합니다. 비교 프로세스는 최대 값을 선택하는 것과 거의 같지만 연속 간격은 단일 값이 아닙니다.

출처

2016-12-28 Alex

여기서 계산할 사항은 무엇입니까? 'nbint','nbinp','hits'의 크기는 어느 정도입니까? 원하는 출력뿐 아니라 입력 데이터의 작은 숫자 샘플을 포함하여 [mcve]를 게시하십시오. –

먼저 배열 t [k]를 계산하고이 배열을 정렬하려고합니다. 원하는 출력은 nbint * nbinp 정렬 된 배열입니다. – Alex

'20 * 2 = 40' 배열이나'40' 요소를 가진 단일 배열을 원하십니까? 루프 내부에서 정렬 작업을 수행하는 이유는 무엇입니까? 알고리즘은 여전히 나에게 불분명하다 –

3 개의 중첩 루프는 nbint*nbinp*hits 값을 계산합니다. 이 두 값은 각각 과 독립적이므로 모든 값을 병렬로 계산할 수 있습니다.

사용자 의견에 출력을 단일 스칼라 값으로 줄이는 교환 가능한 "필터 조건"이 있음을 언급하셨습니다. 이는 임시 값을 정렬 및 저장하지 않으려는 경우 악용 될 수 있습니다. 대신, 우리는 값을 즉시 계산할 수 있고 최종 결과를 결정하기 위해 병렬 감소를 적용 할 수 있습니다.

"원시"CUDA에서 수행 할 수 있습니다. 아래에서 나는 추력을 사용하여이 아이디어를 구현했습니다. 주요 아이디어는 병렬로 grid_opnbint*nbinp*hits 번 실행하는 것입니다. grid_op으로 전달되는 단일 스칼라 색인에서 3 개의 원래 "루프 색인"을 찾으려면 this SO question에서 알고리즘이 사용됩니다.

thrust::transform_reduce은 즉석 변환 및 후속 병렬 축소 (여기서는 thrust::maximum이 대용품으로 사용됨)를 수행합니다.

#include <cmath> 

#include <thrust/device_vector.h> 
#include <thrust/functional.h> 
#include <thrust/transform_reduce.h> 
#include <thrust/iterator/counting_iterator.h> 
#include <thrust/tuple.h> 

// ### BEGIN utility for demo #### 
#include <iostream> 
#include <thrust/random.h> 

thrust::host_vector<float> random_vector(const size_t N) 
{ 
    thrust::default_random_engine rng; 
    thrust::uniform_real_distribution<float> u01(0.0f, 1.0f); 
    thrust::host_vector<float> temp(N); 
    for(size_t i = 0; i < N; i++) { 
     temp[i] = u01(rng); 
    } 
    return temp; 
} 
// ### END utility for demo #### 

template <typename... Iterators> 
thrust::zip_iterator<thrust::tuple<Iterators...>> zip(Iterators... its) 
{ 
    return thrust::make_zip_iterator(thrust::make_tuple(its...)); 
} 

template <typename ZipIterator> 
class grid_op 
{ 
public: 
    grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2) : zipIt(zipIt), dim1(dim1), dim2(dim2){} 

    __host__ __device__ 
    float operator()(std::size_t index) const 
    { 
     const auto coords = unflatten_3d_index(index, dim1, dim2); 
     const auto values = zipIt[thrust::get<2>(coords)]; 
     const float delta = 5; 
     const float _theta = (0.5f + thrust::get<0>(coords))*delta; 
     const float _phi = (0.5f + thrust::get<1>(coords))*delta/_theta; 
     const float _l = sin(_theta* M_PI/180.0)*cos(_phi* M_PI/180.0); 
     const float _m = sin(_theta* M_PI/180.0)*sin(_phi* M_PI/180.0); 
     const float _n = cos(_theta* M_PI/180.0); 
     const float _k = -(_l*thrust::get<0>(values) + _m*thrust::get<1>(values) + _n*thrust::get<2>(values)); 
     return (thrust::get<3>(values) - _k); 
    } 

private: 
    __host__ __device__ 
    thrust::tuple<std::size_t, std::size_t, std::size_t> 
    unflatten_3d_index(std::size_t index, std::size_t dim1, std::size_t dim2) const 
    { 
     // taken from https://stackoverflow.com/questions/29142417/4d-position-from-1d-index 
     std::size_t x = index % dim1; 
     std::size_t y = ((index - x)/dim1) % dim2; 
     std::size_t z = ((index - y * dim1 - x)/(dim1 * dim2)); 
     return thrust::make_tuple(x,y,z); 
    } 

    ZipIterator zipIt; 
    std::size_t dim1; 
    std::size_t dim2; 
}; 

template <typename ZipIterator> 
grid_op<ZipIterator> make_grid_op(ZipIterator zipIt, std::size_t dim1, std::size_t dim2) 
{ 
    return grid_op<ZipIterator>(zipIt, dim1, dim2); 
} 

int main() 
{ 
    const int nbint = 3; 
    const int nbinp = 4; 
    const int hits = 20; 
    const std::size_t N = nbint * nbinp * hits; 

    thrust::device_vector<float> d_x = random_vector(hits); 
    thrust::device_vector<float> d_y = random_vector(hits); 
    thrust::device_vector<float> d_z = random_vector(hits); 
    thrust::device_vector<float> d_a = random_vector(hits); 

    auto zipIt = zip(d_x.begin(), d_y.begin(), d_z.begin(), d_a.begin()); 
    auto countingIt = thrust::counting_iterator<std::size_t>(0); 
    auto unary_op = make_grid_op(zipIt, nbint, nbinp); 
    auto binary_op = thrust::maximum<float>(); 
    const float init = 0; 

    float max = thrust::transform_reduce(
     countingIt, countingIt+N, 
     unary_op, 
     init, 
     binary_op 
    ); 

    std::cout << "max = " << max << std::endl; 
}

출처

2017-01-02 11:56:29

감사합니다. 당신의 대답은 매우 도움이되었습니다. 프로그래밍 가이드에서 추력에 대한 자세한 내용을 확인하고 색인 부분도 매우 유용합니다. 어쩌면 그것은 commutative 및 associative (미안)에 대한 나의 오해입니다. 그러나 나는 정말로 여기에 정렬 부분이 필요합니다. 정렬을 포함 시켜도 될까요? – Alex

@Alex 확실히, 당신은 정렬을 추가 할 수 있습니다,하지만 당신은 비행 감소에 다음을 수행 할 수 없을 것입니다, 그래서 성능이 훨씬 낮아질 것입니다. ** 당신이 정렬해야하는 이유 **를 보여주기 위해 질문을 편집해야합니다. –

제 질문을 편집하여 왜 정렬 부분이 필요한지 보여줍니다. 당신이 이해할 수 있기를 바랍니다. 아니면 이것에 대해 그림을 그릴 수 있기를 바랍니다. 감사합니다 ~ – Alex

이 중첩 된 for 루프를 병렬 처리하는 CUDA를 사용할 수 있습니까?

답변

관련 문제