CUB 및 추력을 하나의 CUDA 코드로 사용하는 방법

내 "기존"추력 코드에 CUB를 소개하려고합니다. 따라서 작은 예제로 시작하여 cub::DeviceReduce::ReduceByKey과 thrust::device_vectors에 모두 적용되는 thrust::reduce_by_key을 비교해보십시오.CUB 및 추력을 하나의 CUDA 코드로 사용하는 방법

코드의 추력 부분은 괜찮지 만, thrust :: raw_pointer_cast를 통해 얻은 원시 포인터를 순진하게 사용하는 CUB 부분은 CUB 호출 후 충돌합니다. 이 문제를 해결하기 위해 cudaDeviceSynchronize()을 넣었지만 도움이되지 않았습니다. 코드의 CUB 부분은 CUB 웹 페이지에서 고정되어 있습니다.

OSX에 런타임 오류는 다음과 같습니다

libc++abi.dylib: terminate called throwing an exception 
Abort trap: 6

리눅스 런타임 오류는 다음과 같습니다

terminate called after throwing an instance of 'thrust::system::system_error' 
what(): an illegal memory access was encountered

CUDA-memcheck의 처음 몇 줄은 다음과 같습니다

========= CUDA-MEMCHECK 
========= Invalid __global__ write of size 4 
=========  at 0x00127010 in /home/sdettrick/codes/MCthrust/tests/../cub-1.3.2/cub/device/dispatch/../../block_range/block_range_reduce_by_key.cuh:1017:void cub::ReduceByKeyRegionKernel<cub::DeviceReduceByKeyDispatch<unsigned int*, unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int>::PtxReduceByKeyPolicy, unsigned int*, unsigned int*, float*, float*, int*, cub::ReduceByKeyScanTileState<float, int, bool=1>, cub::Equality, CustomSum, int>(unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int, cub::DeviceReduceByKeyDispatch<unsigned int*, unsigned int*, float*, float*, int*, cub::Equality, CustomSum, int>::PtxReduceByKeyPolicy, unsigned int*, int, cub::GridQueue<int>) 
=========  by thread (0,0,0) in block (0,0,0) 
=========  Address 0x7fff7dbb3e88 is out of bounds 
=========  Saved host backtrace up to driver entry point at kernel launch time

불행하게도 나는 그것에 대해 무엇을해야할지 너무 확신하지 못합니다.

도움을 주시면 감사하겠습니다. 나는 NVIDIA 개발자 존에서 이것을 시도했지만 응답을 얻지 못했습니다. 전체 예제 코드는 아래와 같습니다. 그것은 CUDA 6.5 및 새끼 1.3.2 컴파일해야합니다

이

#include <iostream> 
#include <thrust/sort.h> 
#include <thrust/gather.h> 
#include <thrust/device_vector.h> 
#include <thrust/iterator/zip_iterator.h> 
#include <thrust/iterator/permutation_iterator.h> 
#include <thrust/iterator/discard_iterator.h> 

#include <cub/cub.cuh> // or equivalently <cub/device/device_radix_sort.cuh> 

//======================================== 
// for CUB: 
struct CustomSum 
{ 
    template <typename T> 
    CUB_RUNTIME_FUNCTION __host__ __device__ __forceinline__ 
    //__host__ __device__ __forceinline__ 
    T operator()(const T &a, const T &b) const { 
     return b+a; 
    } 
}; 
//======================================== 

int main() 
{ 
    const int Nkey=20; 
    int Nseg=9; 
    int ikey[Nkey] = {0, 0, 0, 6, 8, 0, 2, 4, 6, 8, 1, 3, 5, 7, 8, 1, 3, 5, 7, 8}; 

    thrust::device_vector<unsigned int> key(ikey,ikey+Nkey); 
    thrust::device_vector<unsigned int> keysout(Nkey); 

    // Let's reduce x, by key: 

    float xval[Nkey]; 
    for (int i=0; i<Nkey; i++) xval[i]=ikey[i]+0.1f; 

    thrust::device_vector<float> x(xval,xval+Nkey); 

    // First, sort x by key: 

    thrust::sort_by_key(key.begin(),key.end(),x.begin()); 

    //--------------------------------------------------------------------- 
    std::cout<<"=================================================================="<<std::endl 
     <<" THRUST reduce_by_key:"<<std::endl 
     <<"=================================================================="<<std::endl; 

    thrust::device_vector<float> output(Nseg,0.0f); 

    thrust::reduce_by_key(key.begin(), 
      key.end(), 
      x.begin(), 
      keysout.begin(), 
      output.begin()); 

    for (int i=0;i<Nkey;i++) std::cout << x[i] <<" "; std::cout<<std::endl; 
    for (int i=0;i<Nkey;i++) std::cout << key[i] <<" "; std::cout<<std::endl; 
    for (int i=0;i<Nseg;i++) std::cout << output[i] <<" "; std::cout<<std::endl; 

    float ototal=thrust::reduce(output.begin(),output.end()); 
    float xtotal=thrust::reduce(x.begin(),x.end()); 
    std::cout << "total="<< ototal <<", should be "<<xtotal<<std::endl; 

    //--------------------------------------------------------------------- 
    std::cout<<"=================================================================="<<std::endl 
     <<" CUB ReduceByKey:"<<std::endl 
     <<"=================================================================="<<std::endl; 


    unsigned int *d_keys_in =thrust::raw_pointer_cast(&key[0]); 
    float  *d_values_in =thrust::raw_pointer_cast(&x[0]); 
    unsigned int *d_keys_out =thrust::raw_pointer_cast(&keysout[0]); 
    float  *d_values_out=thrust::raw_pointer_cast(&output[0]); 
    int   *d_num_segments=&Nseg; 
    CustomSum reduction_op; 

    std::cout << "CUB input" << std::endl; 
    for (int i=0; i<Nkey; ++i) std::cout << key[i] << " "; std::cout<<std::endl; 
    for (int i=0; i<Nkey; ++i) std::cout << x[i] << " "; std::cout<< std::endl; 
    for (int i=0; i<Nkey; ++i) std::cout << keysout[i] << " "; std::cout<< std::endl; 
    for (int i=0; i<Nseg; ++i) std::cout << output[i] << " "; std::cout<< std::endl; 

    // Determine temporary device storage requirements 
    void  *d_temp_storage = NULL; 
    size_t temp_storage_bytes = 0; 
    cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, Nkey); 

    // Allocate temporary storage 
    cudaMalloc(&d_temp_storage, temp_storage_bytes); 
    std::cout << "temp_storage_bytes = " << temp_storage_bytes << std::endl; 

    // Run reduce-by-key 
    cub::DeviceReduce::ReduceByKey(d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, d_num_segments, reduction_op, Nkey); 
    cudaDeviceSynchronize(); 

    std::cout << "CUB output" << std::endl; 

    std::cout<<Nkey<<" "<<Nseg<<std::endl; 
    std::cout<<key.size() << " "<<x.size() << " "<<keysout.size() << " "<<output.size() << std::endl; 

    // At this point onward it dies: 
    //libc++abi.dylib: terminate called throwing an exception 
    //Abort trap: 6 

    // If the next line is uncommented, it crashes the Mac! 
    for (int i=0; i<Nkey; ++i) std::cout << key[i] << " "; std::cout<<std::endl; 
    // for (int i=0; i<Nkey; ++i) std::cout << x[i] << " "; std::cout<< std::endl; 
    // for (int i=0; i<Nkey; ++i) std::cout << keysout[i] << " "; std::cout<< std::endl; 
    // for (int i=0; i<Nseg; ++i) std::cout << output[i] << " "; std::cout<< std::endl; 
    cudaFree(d_temp_storage); 

    ototal=thrust::reduce(output.begin(),output.end()); 
    xtotal=thrust::reduce(x.begin(),x.end()); 
    std::cout << "total="<< ototal <<", should be "<<xtotal<<std::endl; 
    return 1; 
}

출처

2014-11-07 Sean Dettrick

이 적합하지 않은 :

int   *d_num_segments=&Nseg;

당신은 호스트 변수의 주소를 가지고 장치 포인터로 사용할 수 없습니다.

대신 이렇게 :

는

int *d_num_segments; 
cudaMalloc(&d_num_segments, sizeof(int));

이 데이터의 크기 (새끼가 기입하는 단일 정수)의 장치에 공간을 할당하고 d_num_segments 변수에 그 할당의 주소를 할당합니다. 이것은 유효한 장치 포인터가됩니다.

(* 일반, 비 UM) CUDA에서 장치 코드의 호스트 주소 또는 호스트 코드의 장치 주소를 참조하는 것이 잘못되었습니다.

출처

2014-11-08 02:44:48

고마워요! 분명한 사실을 밝힐 위험이 있으므로 달성하려는 작업을 달성하기 위해'cudaMemcpy (d_num_segments, & Nseg, sizeof (int), cudaMemcpyHostToDevice); '를 추가해야합니다. –

이유가 확실하지 않습니다. CUB는 해당 위치/값에 기록합니다. 그것은 CUB의 입력이 아니며 CUB의 출력입니다. 당신이 쓰는 것이 무엇이든간에 발견 된 세그먼트의 수를 세면서 CUB가 그것을 덮어 씁니다. [해당 함수 및 값에 대한 설명서] (http://nvlabs.github.io/cub/structcub_1_1_device_reduce.html#a4822e04d8701b10ac3f2d28effb454d3) (NumSegmentsIterator) –

흠, 나는이 문장없이 오류가 있다고 생각했지만 지금은 나는 그것을 제거하고 결국 오류가 없습니다. –

CUB 및 추력을 하나의 CUDA 코드로 사용하는 방법

답변

관련 문제