OpenCL : 루프에서 커널을 여러 번 호출 한 후 장치에서 호스트 전송으로 INVALID_COMMAND_QUEUE 받기

동일한 루프를 여러 번 호출하는 OpenCL 프로그램에서 작업하고 있습니다. clEnqueueReadBuffer를 사용하여 장치 메모리를 호스트로 다시 전송할 때 명령 대기열이 유효하지 않다고보고합니다.OpenCL : 루프에서 커널을 여러 번 호출 한 후 장치에서 호스트 전송으로 INVALID_COMMAND_QUEUE 받기

아래는 격렬한 정렬을 시작하기 위해 호출되는 함수로, 읽기 쉽도록 단축되어 있습니다. 장치 목록, 컨텍스트, 명령 대기열 및 커널이 외부에서 생성되어이 함수에 전달됩니다. 목록에는 정렬 할 목록이 포함되어 있습니다. 크기은 목록의에있는 요소 수입니다.

cl_int OpenCLBitonicSort(cl_device_id device, cl_context context, 
    cl_command_queue commandQueue, cl_kernel bitonicSortKernel, 
    unsigned int * list, unsigned int size){ 

    //create OpenCL specific variables 
    cl_int error = CL_SUCCESS; 
    size_t maximum_local_ws; 
    size_t local_ws; 
    size_t global_ws; 

    //create variables that keep track of bitonic sorting progress 
    unsigned int stage = 0; 
    unsigned int subStage; 
    unsigned int numberOfStages = 0; 

    //get maximum work group size 
    clGetKernelWorkGroupInfo(bitonicSortKernel, device, 
     CL_KERNEL_WORK_GROUP_SIZE, sizeof(maximum_local_ws), 
     &maximum_local_ws, NULL); 

    //make local_ws the largest perfect square allowed by OpenCL 
    for(i = 1; i <= maximum_local_ws; i *= 2){ 
     local_ws = (size_t) i; 
    } 
    //total number of comparators will be half the items in the list 
    global_ws = (size_t) size/2; 

    //transfer list to the device 
    cl_mem list_d = clCreateBuffer(context, CL_MEM_COPY_HOST_PTR, 
     size * sizeof(unsigned int), list, &error); 

    //find the number of stages needed (numberOfStages = ln(size)) 
    for(numberOfStages = 0; (1 << numberOfStages^size); numberOfStages++){ 
    } 

    //loop through all stages 
    for(stage = 0; stage < numberOfStages; stage++){ 
     //loop through all substages in each stage 
     for(subStage = stage, i = 0; i <= stage; subStage--, i++){ 
      //add kernel parameters 
      error = clSetKernelArg(bitonicSortKernel, 0, 
       sizeof(cl_mem), &list_d); 
      error = clSetKernelArg(bitonicSortKernel, 1, 
       sizeof(unsigned int), &size); 
      error = clSetKernelArg(bitonicSortKernel, 2, 
       sizeof(unsigned int), &stage); 
      error = clSetKernelArg(bitonicSortKernel, 3, 
       sizeof(unsigned int), &subStage); 

      //call the kernel 
      error = clEnqueueNDRangeKernel(commandQueue, bitonicSortKernel, 1, 
       NULL, &global_ws, &local_ws, 0, NULL, NULL); 

      //wait for the kernel to stop executing 
      error = clEnqueueBarrier(commandQueue); 
     } 
    } 

    //read the result back to the host 
    error = clEnqueueReadBuffer(commandQueue, list_d, CL_TRUE, 0, 
     size * sizeof(unsigned int), list, 0, NULL, NULL); 

    //free the list on the device 
    clReleaseMemObject(list_d); 

    return error; 
}

이 코드에서 : clEnqueueReadBuffer는 commandQueue가 유효하지 않음을 나타냅니다. 그러나 clEnqueueNDRangeKernel 및 clEnqueueBarrier를 호출했을 때 유효했습니다. (결과가 정확 아니었지만) clEnqueueNDRangeKernel는 한 번만 호출 될 수 있도록

내가, 그냥 0으로 1 단계을 할 numberOfStages을 설정하면 코드는 오류를 반환하지 않고 일했다. clEnqueueNDRangeKernel을 두 번 이상 호출하는 데 문제가 있습니다 (실제로해야합니다).

저는 Mac OS 10.6 Snow Leopard를 사용하고 있으며 Apple의 OpenCL 1.0 플랫폼과 NVidia GeForce 9600m을 사용하고 있습니다. 다른 플랫폼의 OpenCL에서 루프 내에서 커널을 실행할 수 있습니까? OS X에서 OpenCL을 사용하면 누구나 이런 문제가 있습니까? 명령 대기열이 유효하지 않게되는 원인은 무엇입니까?

출처

2012-10-28 user1509669