ios에서의 빠른 컨볼 루션

16X16 생성 커널을 사용하여 이미지에 컨볼 루션을 수행하려고합니다. opencv filterengine 클래스를 사용했지만 CPU에서만 작동하며 앱을 가속화하려고합니다. opencv도 filterengine_gpu를 가지고 있지만 IOS를 지원하지 않는다는 것을 이해하고 있습니다. GPU 이미지를 사용하면 3X3 생성 필터로 컨볼 루션을 수행 할 수 있습니다. 회선을 가속화하는 다른 방법이 있습니까? GPU에서 작동하는 다른 라이브러리?ios에서의 빠른 컨볼 루션

출처

2013-10-14 RamBracha

GPUImage를 사용하여 16x16 컨볼 루션을 할 수 있지만 직접 필터를 작성해야합니다. 프레임 워크의 3x3 컨볼 루션은 입력 이미지의 각 픽셀 주변에있는 3x3 영역의 픽셀을 샘플링하고 피드에 포함 된 가중치 행렬을 적용합니다. 프레임 워크 내의 GPUImage3x3ConvolutionFilter.m 소스 파일은 읽기가 쉽지만 네가 가지고있는 것 이상으로 나아가고 싶다면 약간의 맥락. 회선에서 사용되는 픽셀의 색상을 샘플의 위치를 계산하는

attribute vec4 position; 
attribute vec4 inputTextureCoordinate; 

uniform float texelWidth; 
uniform float texelHeight; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    gl_Position = position; 

    vec2 widthStep = vec2(texelWidth, 0.0); 
    vec2 heightStep = vec2(0.0, texelHeight); 
    vec2 widthHeightStep = vec2(texelWidth, texelHeight); 
    vec2 widthNegativeHeightStep = vec2(texelWidth, -texelHeight); 

    textureCoordinate = inputTextureCoordinate.xy; 
    leftTextureCoordinate = inputTextureCoordinate.xy - widthStep; 
    rightTextureCoordinate = inputTextureCoordinate.xy + widthStep; 

    topTextureCoordinate = inputTextureCoordinate.xy - heightStep; 
    topLeftTextureCoordinate = inputTextureCoordinate.xy - widthHeightStep; 
    topRightTextureCoordinate = inputTextureCoordinate.xy + widthNegativeHeightStep; 

    bottomTextureCoordinate = inputTextureCoordinate.xy + heightStep; 
    bottomLeftTextureCoordinate = inputTextureCoordinate.xy - widthNegativeHeightStep; 
    bottomRightTextureCoordinate = inputTextureCoordinate.xy + widthHeightStep; 
}

내가 할 첫 번째 일은 다음과 버텍스 쉐이더를 사용합니다. 정규화 된 좌표가 사용되므로 픽셀 사이의 X 및 Y 간격은 각각 1.0/[이미지 폭] 및 1.0/[이미지 높이]입니다.

샘플링 할 픽셀의 텍스처 좌표는 두 가지 이유로 정점 셰이더에서 계산됩니다. 정점 당 한 번이 계산을 수행하는 것이 더 효율적입니다. (이 중 두 개의 삼각형에 6 개가있는 것이 더 효율적입니다. 이미지)를 각 프래그먼트 (픽셀)보다 많이 사용하고 가능하면 종속 텍스처를 읽지 않도록합니다. 종속 텍스처 읽기는 버텍스 쉐이더에서 단순히 전달되는 것이 아니라 프래그먼트 셰이더에서 읽어 올 텍스처 좌표를 계산하는 곳이며 iOS GPU에서 훨씬 느립니다. 내가 버텍스 쉐이더에서 계산 된 텍스쳐 위치가되면

, 나는 varyings로 조각 쉐이더로를 통과하고 거기에 다음 코드를 사용

uniform sampler2D inputImageTexture; 

uniform mat3 convolutionMatrix; 

varying vec2 textureCoordinate; 
varying vec2 leftTextureCoordinate; 
varying vec2 rightTextureCoordinate; 

varying vec2 topTextureCoordinate; 
varying vec2 topLeftTextureCoordinate; 
varying vec2 topRightTextureCoordinate; 

varying vec2 bottomTextureCoordinate; 
varying vec2 bottomLeftTextureCoordinate; 
varying vec2 bottomRightTextureCoordinate; 

void main() 
{ 
    vec3 bottomColor = texture2D(inputImageTexture, bottomTextureCoordinate).rgb; 
    vec3 bottomLeftColor = texture2D(inputImageTexture, bottomLeftTextureCoordinate).rgb; 
    vec3 bottomRightColor = texture2D(inputImageTexture, bottomRightTextureCoordinate).rgb; 
    vec4 centerColor = texture2D(inputImageTexture, textureCoordinate); 
    vec3 leftColor = texture2D(inputImageTexture, leftTextureCoordinate).rgb; 
    vec3 rightColor = texture2D(inputImageTexture, rightTextureCoordinate).rgb; 
    vec3 topColor = texture2D(inputImageTexture, topTextureCoordinate).rgb; 
    vec3 topRightColor = texture2D(inputImageTexture, topRightTextureCoordinate).rgb; 
    vec3 topLeftColor = texture2D(inputImageTexture, topLeftTextureCoordinate).rgb; 

    vec3 resultColor = topLeftColor * convolutionMatrix[0][0] + topColor * convolutionMatrix[0][1] + topRightColor * convolutionMatrix[0][2]; 
    resultColor += leftColor * convolutionMatrix[1][0] + centerColor.rgb * convolutionMatrix[1][1] + rightColor * convolutionMatrix[1][2]; 
    resultColor += bottomLeftColor * convolutionMatrix[2][0] + bottomColor * convolutionMatrix[2][1] + bottomRightColor * convolutionMatrix[2][2]; 

    gl_FragColor = vec4(resultColor, centerColor.a);

이 9 개 색상의 각을 읽고는 가중치를 적용 convolution을 위해 제공된 3x3 행렬로부터.

즉, 16x16 컨볼 루션은 상당히 비싼 작업입니다. 픽셀 당 256 개의 텍스처 읽기를보고 있습니다.오래된 장치 (iPhone 4 또는 그 이상)의 경우, 비 종속 읽기 인 경우 픽셀 당 약 8 개의 텍스처 읽기가 무료로 제공됩니다. 일단 당신이 그것을 지나치면, 공연은 극적으로 떨어지기 시작했습니다. 그래도 나중에 GPU가 크게 향상되었습니다. 예를 들어 iPhone 5S는 무료로 픽셀 당 40 개 이상의 종속 텍스처를 읽습니다. 1080p 비디오의 가장 무거운 쉐이더조차도 거의 느려지지 않습니다.

sansuis가 암시하는 것처럼, 커널을 수평 및 수직 패스 (가우시안 블러 커널의 경우와 같이)로 분리하는 방법을 사용하면 텍스처 읽기가 크게 감소하므로 성능이 훨씬 향상 될 수 있습니다. 16x16 커널의 경우 256 개 읽기에서 32 개로 떨어질 수 있습니다. 심지어 32 개 샘플은 한 번에 16 개 텍셀 만 샘플링하기 때문에 훨씬 빠릅니다.

OpenGL ES보다 CPU 가속화 속도가 빠른 크로스 오버 포인트는 실행중인 장치에 따라 다릅니다. 일반적으로 iOS 기기의 GPU는 최근 세대의 성능 향상에있어 CPU를 초과하므로 바는 지난 몇 가지 iOS 모델에서 GPU쪽으로 멀리 이동했습니다.

출처

2013-10-14 15:13:28

Apple의 Accelerate framework을 사용할 수 있습니다. iOS와 MacOS에서 사용할 수 있으므로 나중에 코드를 다시 사용할 수 있습니다.

최상의 성능을 달성하기 위해, 다음과 같은 옵션을 고려할 필요가 있습니다 : 당신의 컨볼 루션 커널이 분리 가능

경우 separable implementation를 사용합니다. 대칭 커널 (Gaussian convolution과 같은)의 경우입니다. 이렇게하면 계산 시간에 엄청난 양이 절약됩니다.
이미지의 파워가 2이면 이미지의 FFT 트릭을 사용하는 것이 좋습니다. 공간 영역에서의 컨볼 루션 (복잡도 N^2)은 푸리에 영역에서의 곱셈 (복잡도 N)과 동일합니다. 따라서, 1) 이미지와 커널을 FFT하고, 2) 결과를 용어별로 곱하고 3) 결과의 FFT를 반전시킬 수 있습니다. FFT 알고리즘이 빠르기 때문에 (예 : Accelerate 프레임 워크에서 Aple의 FFT) 이러한 일련의 연산으로 인해 성능이 향상 될 수 있습니다.

this book에서 iOS 이미지 처리 최적화에 대한 자세한 정보는 here을 참조하십시오.

출처

2013-10-14 06:50:03 sansuiso

ios에서의 빠른 컨볼 루션

답변

관련 문제