OpenCL - 글로벌 메모리가 로컬보다 우수한 기본 형식을 읽습니다.

전역 메모리 사용에서 로컬 메모리로 전환 할 때 실행 시간이 증가한 NVidia GTX 680에서 실행중인 커널이 있습니다.OpenCL - 글로벌 메모리가 로컬보다 우수한 기본 형식을 읽습니다.

유한 요소 레이 트레이서의 일부인 내 커널은 처리하기 전에 각 요소를 로컬 메모리에로드합니다. 각 요소에 대한 데이터는 다음과 같이 정의되어있는 구조체 fastTriangle에 저장되어

typedef struct fastTriangle { 
    float cx, cy, cz, cw; 
    float nx, ny, nz, nd; 
    float ux, uy, uz, ud; 
    float vx, vy, vz, vd; 
} fastTriangle;

난 다음과 같이 기입된다 커널이 오브젝트의 배열을 패스 (I는 간결성을 위해 무관 코드를 제거한 :

__kernel void testGPU(int n_samples, const int n_objects, global const fastTriangle *objects, __local int *x_res, __global int *hits) { 
    // Get gid, lid, and lsize 

    // Set up random number generator and thread variables 

    // Local storage for the two triangles being processed 
    __local fastTriangle triangles[2]; 

    for(int i = 0; i < n_objects; i++) { // Fire ray from each object 
     event_t evt = async_work_group_copy((local float*)&triangles[0], (global float*)&objects[i],sizeof(fastTriangle)/sizeof(float),0); 

     //Initialise local memory x_res to 0's 

     barrier(CLK_LOCAL_MEM_FENCE); 
     wait_group_events(1, &evt);  


     Vector wsNormal = { triangles[0].cw*triangles[0].nx, triangles[0].cw*triangles[0].ny, triangles[0].cw*triangles[0].nz}; 

     for(int j = 0; j < n_samples; j+= 4) { 
      // generate a float4 of random numbers here (rands 

      for(int v = 0; v < 4; v++) { // For each ray in ray packet 
       //load the first object to be intesected 
       evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[0],sizeof(fastTriangle)/sizeof(float),0); 

       // Some initialising code and calculate ray here 
       // Should have ray fully specified at this point; 


       for(int w = 0; w < n_objects; w++) {  // Check for intersection against each ray 

        wait_group_events(1, &evt); 

        // Check for intersection against object w 


        float det = wsDir.x*triangles[1].nx + wsDir.y*triangles[1].ny + wsDir.z*triangles[1].nz; 
        float dett = triangles[1].nd - (triangles[0].cx*triangles[1].nx + triangles[0].cy*triangles[1].ny + triangles[0].cz*triangles[1].nz); 


        float detpx = det*triangles[0].cx + dett*wsDir.x; 
        float detpy = det*triangles[0].cy + dett*wsDir.y; 
        float detpz = det*triangles[0].cz + dett*wsDir.z; 


        float detu = detpx*triangles[1].ux + detpy*triangles[1].uy + detpz*triangles[1].uz + det*triangles[1].ud; 
        float detv = detpx*triangles[1].vx + detpy*triangles[1].vy + detpz*triangles[1].vz + det*triangles[1].vd; 


        // Interleaving the copy of the next triangle 
        evt = async_work_group_copy((local float*)&triangles[1], (global float*)&objects[w+1],sizeof(fastTriangle)/sizeof(float),0); 

        // Complete intersection calculations 

       } // end for each object intersected 

       if(objectNo != -1) atomic_inc(&x_res[objectNo]); 
      } // end for sub rays 
     } // end for each ray 
     barrier(CLK_LOCAL_MEM_FENCE); 

     // Add all the local x_res to global array hits 


     barrier(CLK_GLOBAL_MEM_FENCE); 
    } // end for each object 
}

내가 먼저 내가 로컬 메모리에있는 각 개체 버퍼 대신 그냥 즉, 대신 삼각형의 글로벌 메모리를 형성 액세스하고 있지 않습니다 커널을 썼다 [0] 나 [i]를 .cx

객체를 사용하는 것이 .cx

최적화를 위해 설정할 때 위에 나열된대로 로컬 메모리를 사용하여 전환했지만 실행 실행 시간이 약 25 % 증가하는 것을 관찰했습니다.

전역 메모리에 직접 액세스하지 않고 로컬 메모리를 사용하여 개체를 버퍼링 할 때 성능이 저하되는 이유는 무엇입니까?

출처

2012-09-17 cubiclewar

로컬 메모리가 더 빨리 실행하는 데 도움이되는 것은 프로그램에 따라 달라집니다. 로컬 메모리를 사용할 때 고려해야 할 두 가지가 있습니다

세계에서 지역에 다시 글로벌 로컬에서 데이터를 복사 할 때 추가 계산을해야합니다.
나는 3 배의 장벽 (...)이 있음을 알았습니다. 이러한 장벽은 성능 저하 요인입니다. 모든 OpenCL 작업은 다른 모든 작업의 장벽을 기다려야합니다. 이렇게하면 병렬 처리가 방해 받고 작업이 더 이상 독립적으로 실행되지 않습니다.

계산시 데이터를 많이 읽을 때 로컬 메모리가 좋습니다. 그러나 빠른 읽기 및 쓰기는 복사 및 동기화보다 성능을 향상시켜야합니다.

출처

2012-09-24 06:55:00 Tomas

나에게 분명하다. 그는 전역에서 로컬로 복사하고 한 번 사용하고 데이터를 삭제합니다. 이 목적은 1 회 액세스가 더 좋습니다. – DarkZeros

OpenCL - 글로벌 메모리가 로컬보다 우수한 기본 형식을 읽습니다.

답변

관련 문제