CUDA의 행 또는 열 감소시키기

행렬 연산을 수행하기 위해 cuBLAS과 함께 CUDA를 사용하고 있습니다.CUDA의 행 또는 열 감소시키기

매트릭스의 행 (또는 열)을 합산해야합니다. 현재 나는 벡터에 행렬을 곱해서 행하고 있지만, 그렇게 효율적으로 보이지는 않습니다.

더 좋은 방법이 있습니까? cuBLAS에서 아무것도 찾을 수 없습니다.

감사합니다.

2013-01-10 Ran

http://stackoverflow.com/questions/3312228/cuda-add-rows-of-a-matrix 도움이 될 수 있습니다 : 아래

, 나는 완전히 예를 들어 일을보고하고있다. 그러나 "때때로"만 필요한 경우, 즉 실행 시간의 상당 부분을 차지하지 않으면 모든 추가 곱셈의 오버 헤드가 발생하더라도 사용자의 방법이 완벽하게 수용 가능하다고 말할 수 있습니다. – us2012

하지만 어쨌든, 이것은 자신을 구현하기에 아주 쉬운 커널입니다. CUDA에 대한 CalState 프리젠 테이션의 예를 살펴보십시오. http://www.calstatela.edu/faculty/rpamula/cs370/GPUProgramming.pptx – us2012

"때로는"좋은 단어가 아닙니다. 나는 신경망을 훈련시키는 일환으로 여러 번 반복적으로 실행합니다. ppts의 예제 코드가 작동하지 않습니다. (매개 변수가 포인터이고 2D 배열처럼 액세스하려고합니다). – Ran

cublas_gemv()을 사용하여 실제로 행렬에 1을 곱하는 것은 손으로 직접 커널을 작성하는 것을 고려하지 않는 한 매우 효율적인 방법입니다.

cublas_gemv()의 mem 대역폭을 쉽게 프로파일 링 할 수 있습니다. 매트릭스 행/열 합산의 이론적 인 피크 성능으로 볼 수있는 전체 매트릭스 데이터를 한 번 읽는 것과 매우 비슷합니다.

cublas_gemv()는 기본적으로 MEM 대역폭 바인딩 작업입니다, 여분의 산술 명령어가 병목되지 않습니다; 때문에

여분의 작업을 "× 1.0"는 많은 성능 저하로 이어질하지 않습니다
FMA 명령은 명령 처리량을 더 줄입니다.
mem of ones 벡터는 대개 매트릭스의 것보다 훨씬 작으며 mem 대역폭으로 줄이기 위해 GPU에서 쉽게 캐시 할 수 있습니다.

cublas_gemv() 또한 매트릭스 레이아웃 문제를 해결하는 데 도움이됩니다. 그것은 행/열 - 주요 및 임의의 패딩에서 작동합니다.

나는 또한 이것에 관해 a similar question에게 물었습니다. 실험에서 cublas_gemv()이 행렬 합계의 또 다른 접근 방식 인 Thrust::reduce_by_key을 사용하여 세그먼트 감소보다 낫다고 나타냅니다. 동일한 주제에 대한 유용한 답변을 포함하는이 일에 관련된

출처

2013-01-10 18:07:27 kangshiyin

의미가 있습니다. 웬일인지, 나는 그것을 위해 보석을 사용하고 있었다. :) 감사. – Ran

@Ran, 내 테스트는이 작업에 대해'cublas_gemv'가'cublas_gemm'보다 2 배 빠르다는 것을 보여줍니다. 테스트 매트릭스의 크기는 3000 x 3000입니다. – kangshiyin

게시물

Reduce matrix rows with CUDA

및

Reduce matrix columns with CUDA

에서 사용할 수 있습니다.

여기 같은 행렬에 의한 행의 곱셈을 통해 행렬의 열을 줄이는 방법을 벡터의 앙상블 인선형 조합을 수행하도록 일반화 할 수있는 방법을 설명하고자합니다. \psi_n 년대는 기저 함수가이고 c_n 년대 팽창 계수 동안 즉, 하나는 다음 벡터 베이시스 확장

f(x_m) 함수 f(x) 샘플이다
을 계산 원한다면 이면 \psi_n을 N x M 행렬로 구성하고 계수 c_n을 행 벡터에 넣은 다음 cublas<t>gemv을 사용하여 벡터 x 행렬 곱셈을 계산할 수 있습니다.

#include <cublas_v2.h> #include <thrust/device_vector.h> #include <thrust/random.h> #include <stdio.h> #include <iostream> #include "Utilities.cuh" /********************************************/ /* LINEAR COMBINATION FUNCTION - FLOAT CASE */ /********************************************/ void linearCombination(const float * __restrict__ d_coeff, const float * __restrict__ d_basis_functions_real, float * __restrict__ d_linear_combination, const int N_basis_functions, const int N_sampling_points, const cublasHandle_t handle) { float alpha = 1.f; float beta = 0.f; cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_N, N_sampling_points, N_basis_functions, &alpha, d_basis_functions_real, N_sampling_points, d_coeff, 1, &beta, d_linear_combination, 1)); } void linearCombination(const double * __restrict__ d_coeff, const double * __restrict__ d_basis_functions_real, double * __restrict__ d_linear_combination, const int N_basis_functions, const int N_sampling_points, const cublasHandle_t handle) { double alpha = 1.; double beta = 0.; cublasSafeCall(cublasDgemv(handle, CUBLAS_OP_N, N_sampling_points, N_basis_functions, &alpha, d_basis_functions_real, N_sampling_points, d_coeff, 1, &beta, d_linear_combination, 1)); } /********/ /* MAIN */ /********/ int main() { const int N_basis_functions = 5; // --- Number of rows -> Number of basis functions const int N_sampling_points = 8; // --- Number of columns -> Number of sampling points of the basis functions // --- Random uniform integer distribution between 10 and 99 thrust::default_random_engine rng; thrust::uniform_int_distribution<int> dist(10, 99); // --- Matrix allocation and initialization thrust::device_vector<float> d_basis_functions_real(N_basis_functions * N_sampling_points); for (size_t i = 0; i < d_basis_functions_real.size(); i++) d_basis_functions_real[i] = (float)dist(rng); thrust::device_vector<double> d_basis_functions_double_real(N_basis_functions * N_sampling_points); for (size_t i = 0; i < d_basis_functions_double_real.size(); i++) d_basis_functions_double_real[i] = (double)dist(rng); /************************************/ /* COMPUTING THE LINEAR COMBINATION */ /************************************/ cublasHandle_t handle; cublasSafeCall(cublasCreate(&handle)); thrust::device_vector<float> d_linear_combination_real(N_sampling_points); thrust::device_vector<double> d_linear_combination_double_real(N_sampling_points); thrust::device_vector<float> d_coeff_real(N_basis_functions, 1.f); thrust::device_vector<double> d_coeff_double_real(N_basis_functions, 1.); linearCombination(thrust::raw_pointer_cast(d_coeff_real.data()), thrust::raw_pointer_cast(d_basis_functions_real.data()), thrust::raw_pointer_cast(d_linear_combination_real.data()), N_basis_functions, N_sampling_points, handle); linearCombination(thrust::raw_pointer_cast(d_coeff_double_real.data()), thrust::raw_pointer_cast(d_basis_functions_double_real.data()), thrust::raw_pointer_cast(d_linear_combination_double_real.data()), N_basis_functions, N_sampling_points, handle); /*************************/ /* DISPLAYING THE RESULT */ /*************************/ std::cout << "Real case \n\n"; for(int j = 0; j < N_sampling_points; j++) { std::cout << "Column " << j << " - [ "; for(int i = 0; i < N_basis_functions; i++) std::cout << d_basis_functions_real[i * N_sampling_points + j] << " "; std::cout << "] = " << d_linear_combination_real[j] << "\n"; } std::cout << "\n\nDouble real case \n\n"; for(int j = 0; j < N_sampling_points; j++) { std::cout << "Column " << j << " - [ "; for(int i = 0; i < N_basis_functions; i++) std::cout << d_basis_functions_double_real[i * N_sampling_points + j] << " "; std::cout << "] = " << d_linear_combination_double_real[j] << "\n"; } return 0; }

출처

2015-09-16 12:44:14 JackOLantern

"Utilities.cuh"는 무엇입니까? 그것은 cublas 또는 thrust에 포함되어 있습니까? –

@TejusPrasad이 [github 저장소] (https://github.com/OrangeOwlSolutions/CUDA-Utilities)에서 찾을 수 있습니다. – JackOLantern

CUDA의 행 또는 열 감소시키기

답변

관련 문제