CUBLAS 라이브러리가 정확한 결과를 제공하지 않습니다.

저는 CUBLAS 라이브러리를 탐색하려고하고 있으며 따라서 API를 사용하여 행렬 곱셈을위한 코드를 작성했습니다. 그러나 나는 이상한 결과를 얻고있다. 아래에 코드와 출력을 붙여 넣습니다. 도와주세요.CUBLAS 라이브러리가 정확한 결과를 제공하지 않습니다.

#include<cublas.h> 

// Thread block size 
#define BLOCK_SIZE 3 

#define WA 3 // Matrix A width 
#define HA 3 // Matrix A height 
#define WB 3 // Matrix B width 
#define HB WA // Matrix B height 
#define WC WB // Matrix C width 
#define HC HA // Matrix C height 
// Allocates a matrix with random float entries. 
void randomInit(float* data, int size) 
{ 
    for (int i = 0; i < size; ++i) 
    data[i] = i; 
} 
///////////////////////////////////////////////////////// 
// Program main 
///////////////////////////////////////////////////////// 

int main(int argc, char** argv) 
{ 

    // 1. allocate host memory for matrices A and B 
    unsigned int size_A = WA * HA; 
    unsigned int mem_size_A = sizeof(float) * size_A; 
    float* h_A = (float*) malloc(mem_size_A); 

    unsigned int size_B = WB * HB; 
    unsigned int mem_size_B = sizeof(float) * size_B; 
    float* h_B = (float*) malloc(mem_size_B); 
    cublasStatus_t status; 
    // 2. initialize host memory 
    randomInit(h_A, size_A); 
    randomInit(h_B, size_B); 

    // 3. print out A and B 
    printf("\n\nMatrix A\n"); 
    for(int i = 0; i < size_A; i++) 
    { 
     printf("%f ", h_A[i]); 
     if(((i + 1) % WA) == 0) 
      printf("\n"); 
    } 

    printf("\n\nMatrix B\n"); 
for(int i = 0; i < size_B; i++) 
{ 
    printf("%f ", h_B[i]); 
    if(((i + 1) % WB) == 0) 
     printf("\n"); 
} 
// 8. allocate device memory 
float* d_A; 
float* d_B; 
cudaMalloc((void**) &d_A, mem_size_A); 
cudaMalloc((void**) &d_B, mem_size_B); 

// 9. copy host memory to device 

status = cublasSetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float), h_A, BLOCK_SIZE,d_A, BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! CUBLAS initialization error\n"); 
    return EXIT_FAILURE; 
} 

status = cublasSetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float), h_B, BLOCK_SIZE,d_B, BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! CUBLAS initialization error\n"); 
    return EXIT_FAILURE; 
} 

// 4. allocate host memory for the result C 
unsigned int size_C = WC * HC; 
unsigned int mem_size_C = sizeof(float) * size_C; 
float* h_C = (float*) malloc(mem_size_C); 

// 10. allocate device memory for the result 
float* d_C; 
cudaMalloc((void**) &d_C, mem_size_C); 

// 5. perform the calculation 
      cublasSgemm('N','N',BLOCK_SIZE,BLOCK_SIZE,BLOCK_SIZE,1.0f,d_A,BLOCK_SIZE,d_B,BLOCK_SIZE,1.0f,d_C,BLOCK_SIZE); 
status = cublasGetError(); 
if (status) { 
    fprintf (stderr, "!!!! kernel execution error.\n"); 
    return EXIT_FAILURE; 
} 

// 11. copy result from device to host 

status = cublasGetMatrix(BLOCK_SIZE,BLOCK_SIZE,sizeof(float),d_C, BLOCK_SIZE,h_C,BLOCK_SIZE); 
if (status != CUBLAS_STATUS_SUCCESS) { 
    fprintf (stderr, "!!!! device access error (read C)\n"); 
    return EXIT_FAILURE; 
} 

// 6. print out the results 
printf("\n\nMatrix C (Results)\n"); 
for(int i = 0; i < size_C; i++) 
{ 
    printf("%f ", h_C[i]); 
    if(((i + 1) % WC) == 0) 
     printf("\n"); 
} 
printf("\n"); 
// 7. clean up memory 
free(h_A); 
free(h_B); 
free(h_C); 
cudaFree(d_A); 
cudaFree(d_B); 
cudaFree(d_C); 

}

--------- 출력 -------------

매트릭스

0.000000 1.000000 2.000000

3.000000 4.000000 5.000000

6.000000 7.000000 8.000000

매트릭스 B

,617,451 515,

0.000000 1.000000 2.000000

6.000000 7.000000 8.000000

매트릭스 C

3.000000 4.000000 5.000000 (결과)

-1998397155538108416.000000 -1998397155538108416.000000 -1998397155538108416.000000

출처

2012-09-06 user1439690

귀하의 문제는 당신이 sgemm 호출에서 초기화되지 않은 메모리를 사용하고 있다는 점이다. cublas_sgemm()이 모든 BLAS GEMM 작업처럼 코드에서

C = alpha * op(A) * op(B) + beta * C

를 계산, 당신은 op(A)=A, op(B)=B, alpha=1. 및 beta=1. 전달된다. 하지만 C 값을 설정하지 않은 경우 GPU의 메모리가 초기화되지 않고 임의의 값을 포함 할 수 있으므로 사용자가보고있는 손상된 결과를 제공 할 수 있습니다.

이

cublasSgemm('N','N',BLOCK_SIZE,BLOCK_SIZE,BLOCK_SIZE,1.0f,d_A, 
      BLOCK_SIZE,d_B,BLOCK_SIZE,0.f,d_C,BLOCK_SIZE);

C = 1.0 * A * B + 0. * C

을 계산하는 당신은보다 합리적인 출력을 얻을해야이에 함수 호출을 변경합니다. 당신이 그 출력을 생성 그것을 얻을 일단의 유지 CUBLAS이 행렬은 열 주요 순서로 저장하는 것으로 가정 것을 발견, 그래서 당신은

15 18 21 
42 54 66 
69 90 111

출처

2012-09-06 05:20:20 talonmies

고마워해야 인쇄 입력에 대한 정확한 인쇄 출력 ... 그것은 일을하십시오 :) – user1439690

@ user1439690 :이 대답으로 문제가 해결되면 아마도 [받아 들일 수있을만큼 친절 할 것입니다.] (http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer- 작업). – talonmies

CUBLAS 라이브러리가 정확한 결과를 제공하지 않습니다.

답변

관련 문제