SSE 내장 함수에 해당하는 네온

네온 내장 함수를 사용하여 최적화 된 코드로 C 코드를 변환하려고합니다.SSE 내장 함수에 해당하는 네온

operant의 벡터가 아닌 2 개의 operant를 조작하는 c 코드는 다음과 같습니다.

uint16_t mult_z216(uint16_t a,uint16_t b){ 
unsigned int c1 = a*b; 
    if(c1) 
    { 
     int c1h = c1 >> 16; 
     int c1l = c1 & 0xffff; 
     return (c1l - c1h + ((c1l<c1h)?1:0)) & 0xffff; 
    } 
    return (1-a-b) & 0xffff; 
}

이 작업의 최적화 된 SEE 버전은 이미 구현 된 다음

#define MULT_Z216_NEON(a, b, out) \ 
    temp = vorrq_u16 (*a, *b); \ 
    // ?? 
    // ?? 
    *b = vsubq_u16(*out, *a); \ 
    *b = vceqq_u16(*out, vdupq_n_u16(0x0000)); \ 
    *b = vshrq_n_u16(*b, 15); \ 
    *out = vsubq_s16(*out, *a); \ 
    *a = vceqq_s16(*c, vdupq_n_u16(0x0000)); \ 
    *c = vaddq_s16(*c, *b); \ 
    *temp = vandq_u16(*temp, *a); \ 
    *out = vsubq_s16(*out, *a);

난 :

#define MULT_Z216_SSE(a, b, c) \ 
    t0 = _mm_or_si128 ((a), (b)); \ //Computes the bitwise OR of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_mullo_epi16 ((a), (b)); \ //low 16-bits of the product of two 16-bit integers 
    (a) = _mm_mulhi_epu16 ((a), (b)); \ //high 16-bits of the product of two 16-bit unsigned integers 
    (b) = _mm_subs_epu16((c), (a)); \ //Subtracts the 8 unsigned 16-bit integers of a from the 8 unsigned 16-bit integers of c and saturates 
    (b) = _mm_cmpeq_epi16 ((b), C_0x0_XMM); \ //Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (b) = _mm_srli_epi16 ((b), 15); \ //shift right 16 bits 
    (c) = _mm_sub_epi16 ((c), (a)); \ //Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a. 
    (a) = _mm_cmpeq_epi16 ((c), C_0x0_XMM); \ ////Compares the 8 signed or unsigned 16-bit integers in a and the 8 signed or unsigned 16-bit integers in b for equality. (0xFFFF or 0x0) 
    (c) = _mm_add_epi16 ((c), (b)); \ // Adds the 8 signed or unsigned 16-bit integers in a to the 8 signed or unsigned 16-bit integers in b. 
    t0 = _mm_and_si128 (t0, (a)); \ //Computes the bitwise AND of the 128-bit value in a and the 128-bit value in b. 
    (c) = _mm_sub_epi16 ((c), t0); ///Subtracts the 8 signed or unsigned 16-bit integers of b from the 8 signed or unsigned 16-bit integers of a.

내가 거의 네온 내장 함수를 사용하여이 일을 변환했습니다 단지 _mm_mullo_epi16 ((a), (b));과 _mm_mulhi_epu16 ((a), (b));의 네온 등가물이 누락되었습니다. NEON에서 뭔가를 오해하고 있거나 그러한 내장 함수가 없습니다. NEONS 내장 함수를 사용하여 이러한 단계를 보관하는 것과 동일한 방법이 없다면?

UPDATE : 함수의 operants은 uint16x8_t NEON 벡터 (각 요소는 0과 65535 사이 uint16_t => 정수이다)이다 :

I는 이하의 점을 강조 깜빡했다. 어떤 사람이 내장 된 vqdmulhq_s16()을 사용할 것을 제안했습니다. 곱셈 내장 함수는 벡터를 부호있는 값으로 해석하고 잘못된 출력을 생성하기 때문에이 함수의 사용은 주어진 구현과 일치하지 않습니다.

출처

2012-07-02 Kami

값이 32767 이상인 경우 아래에 제안 된 확대 곱하기 (vmull_u16)를 사용해야합니다. 값이 모두 32768보다 작 으면 vqdmulhq_s16을 사용할 수 있습니다. – BitBank

당신은 사용할 수 있습니다 : 32 개 비트 제품의 벡터를 반환

uint32x4_t vmull_u16 (uint16x4_t, uint16x4_t)

합니다. 결과를 높은 부분과 낮은 부분으로 나누려면 NEON 내장 함수를 사용할 수 있습니다.

출처

2012-07-02 18:30:50

그 명령은 16x16 = 32 곱셈 (출력을 넓힘)입니다. 더 자세한 지침이 있습니다 (내 대답 참조). – BitBank

@BitBank : OP는 상위 16 비트와 하위 16 비트가 필요하므로 32 비트 결과가 필요합니다. 더블링/포화 곱셈은 정밀도를 잃어 버리기 때문에 대용 할 수 없습니다. –

vmulq_s16()은 _mm_mullo_epi16과 동일합니다. _mm_mulhi_epu16과 정확히 일치하는 것은 없습니다. 가장 가까운 명령은 vqdmulhq_s16()이며 "포화, 배가, 곱하기, 높은 부분 반환"입니다. 이것은 부호가있는 16 비트 값에서만 작동하며 두배를 무효화하려면 입력 또는 출력을 2로 나눌 필요가 있습니다.

출처

2012-07-02 22:02:13 BitBank

vqdmulhq_s16()이 부호가있는 입력을 사용하기 때문에 GCC는 잘못된 형식화 된 인수에 대해 불평하고 있습니다 ... uint16x8_t에서 int16x8_t로 효율적으로 변환하는 방법은 무엇입니까? – Kami

캐스팅 매크로가 있습니다. vreinterpretq_s16_u16()을 사용하십시오. – BitBank

부호있는 곱셈에 대한 제 편집을보십시오! – Kami

SSE 내장 함수에 해당하는 네온

답변

관련 문제