인텔 내장 함수를 사용하는 동안 코드 속도가 향상되지 않습니다.

내장 함수를 사용하여 실행중인 openCV 코드를 가속화하고 있습니다. 하지만 코드를 Intrinsics로 대체 한 후 코드의 런타임 비용은 거의 동일하거나 더 나빠질 수 있습니다. 나는 무엇이 왜 일어나는지를 알 수 없다. 나는 꽤 오랫동안이 문제를 조사해 왔지만 변화를 주목했다. 누군가가 나를 도울 수 있으면 고맙게 생각합니다. 고마워요! 여기인텔 내장 함수를 사용하는 동안 코드 속도가 향상되지 않습니다.

 // if useSSE is true,run the code with intrinsics and takes 1.45ms in my computer 
     // and if not run the general code and takes the same time. 
    cv::Mat<float> results(shape.rows,2); 
    if (useSSE) { 
     float* pshape = (float*)shape.data; 
     results = shape.clone(); 
     float* presults = (float*)results.data; 
     // use SSE 
     __m128 xyxy_center = _mm_set_ps(bbox.center_y, bbox.center_x, bbox.center_y, bbox.center_x); 

     float bbox_width = bbox.width/2; 
     float bbox_height = bbox.height/2; 
     __m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width); 
     gettimeofday(&start, NULL); // this is for counting time 

     int shape_size = shape.rows*shape.cols; 
     for (int i=0; i<shape_size; i +=4) { 
      __m128 a = _mm_loadu_ps(pshape+i); 
      __m128 result = _mm_div_ps(_mm_sub_ps(a, xyxy_center), xyxy_size); 
      _mm_storeu_ps(presults+i, result); 
     } 
    }else { 
     //SSE TO BE DONE 
     for (int i = 0; i < shape.rows; i++){ 
      results(i, 0) = (shape(i, 0) - bbox.center_x)/(bbox.width/2.0); 
      results(i, 1) = (shape(i, 1) - bbox.center_y)/(bbox.height/2.0); 
     } 
    } 
    gettimeofday(&end, NULL); 
    diff = 1000000*(end.tv_sec-start.tv_sec)+end.tv_sec-start.tv_usec; 
    std::cout<<diff<<"-----"<<std::endl; 
    return results;

출처

2016-06-08 JochimYoung

_ _ _ _ 코드는 답변을 얻는 데 도움이 될 수 있습니다. [mcve]를 수행하는 방법을 참조하십시오 – Miki

또한 코드에서 수행하는 작업을 설명해야합니다. – Catree

정말로 거기에'div_ps'가 필요한가요? 아니면 역으로 번식해도 괜찮습니까? – harold

귀하의 SSE 최적화가 손상 메모리 내 코드 것입니다 가까운 경우 shape.rows % 1
루프에서 변수 i를 사용하여 피하십시오 2 == 사용 포인터 직접 변수를 발생합니다. 컴파일러는 플러스 연산을 최적화 할 수도 있고 그렇지 않을 수도 있습니다. 대신 사업부의

사용 곱셈 :

float bbox_width_inv = 2./bbox.width; 
float bbox_height_inv = 2./bbox.height; 
__m128 xyxy_size = _mm_set_ps(bbox_height, bbox_width, bbox_height, bbox_width); 
float* p_shape_end = p_shape + shape.rows*shape.cols; 
float* p_shape_end_batch = p_shape + shape.rows*shape.cols & (~3); 
for (; p_shape<p_shape_end_batch; p_shape+=4, presults+=4) { 
    __m128 a = _mm_loadu_ps(pshape); 
    __m128 result = _mm_mul_ps(_mm_sub_ps(a, xyxy_center), xyxy_size_inv); 
    _mm_storeu_ps(presults, result); 
} 
while (p_shape < p_shape_end) { 
    presults++ = (p_shape++ - bbox.center_x) * bbox_width_inv; 
    presults++ = (p_shape++ - bbox.center_y) * bbox_height_inv; 
}

봅니다 내장 함수에서 생성 된 코드를 분해하고 작업을 수행 할 수있는 충분한 레지스터이 있는지 확인, 그것은 RAM에 일시적으로 결과를 저장하지 않습니다하는

출처

2016-06-08 18:47:30 taarraas

나에게 준 모든 조언에 감사드립니다! 첫 번째로, shape.rows는 기본적으로 2이고, 내가 분명히하지 않았기 때문에 유감스럽게 생각합니다. 다른 사람들을 위해 나는 이러한 개선을 시도했지만 실행 시간의 결과에는 아무런 차이가 없다. – JochimYoung

곱셈을 나누는 부분이 도움이되지 않았습니까? – taarraas

Nop ... 예, 아주 이상합니다. – JochimYoung

인텔 내장 함수를 사용하는 동안 코드 속도가 향상되지 않습니다.

답변

관련 문제