L3 CPU 캐시 자바 벤치 마크가 이상한 결과를 보여줍니다

이 article을 읽은 후 내 노트북에서 확인하기로했습니다. 아이디어는 크기가 [1..40] Mb 인 배열을 만든 다음 1024 번 반복합니다 (예 : 크기가 1 인 단계는 1024, 크기는 2mb 인 배열은 2048 등). 내 코드는 다음과 같습니다L3 CPU 캐시 자바 벤치 마크가 이상한 결과를 보여줍니다

public class L3CacheBenchmark { 

    @State(Scope.Benchmark) 
    public static class P { 

     @Param({ 
         "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", 
         "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", 
         "21", "22", "23", "24", "25", "26", "27", "28", "29", "30", 
         "31", "32", "33", "34", "35", "36", "37", "38", "39", "40", 
       }) 
     public int size; 
    } 

    @State(Scope.Thread) 
    public static class ThreadData { 

     byte[] array; 
     int len; 

     @Setup 
     public void setup(P p) { 
      array = new byte[p.size * 1024 * 1024]; 
      len = array.length; 
     } 
    } 


    @Benchmark 
    public byte[] testMethod(ThreadData data) { 
     int step = (data.len/1024) - 1; 
     for (int k = 0; k < data.len; k += step) { 
      data.array[k] = 1; 
     } 
     return data.array; 
    } 

}

그리고 결과 :

Benchmark     (size) Mode Cnt  Score  Error Units 
L3CacheBenchmark.testMethod  1 thrpt 100 310521,031 ± 1124,590 ops/s 
L3CacheBenchmark.testMethod  2 thrpt 100 331853,495 ± 1124,547 ops/s 
L3CacheBenchmark.testMethod  3 thrpt 100 311499,659 ± 745,414 ops/s 
L3CacheBenchmark.testMethod  4 thrpt 100 290270,382 ± 8501,690 ops/s 
L3CacheBenchmark.testMethod  5 thrpt 100 212929,246 ± 14847,931 ops/s 
L3CacheBenchmark.testMethod  6 thrpt 100 315968,138 ± 4454,210 ops/s 
L3CacheBenchmark.testMethod  7 thrpt 100 209679,904 ± 26050,365 ops/s 
L3CacheBenchmark.testMethod  8 thrpt 100 60409,187 ± 212,548 ops/s 
L3CacheBenchmark.testMethod  9 thrpt 100 221290,756 ± 28970,586 ops/s 
L3CacheBenchmark.testMethod  10 thrpt 100 322865,687 ± 1545,967 ops/s 
L3CacheBenchmark.testMethod  11 thrpt 100 263153,747 ± 18497,624 ops/s 
L3CacheBenchmark.testMethod  12 thrpt 100 298683,205 ± 1277,032 ops/s 
L3CacheBenchmark.testMethod  13 thrpt 100 180984,220 ± 26611,649 ops/s 
L3CacheBenchmark.testMethod  14 thrpt 100 324815,938 ± 1657,303 ops/s 
L3CacheBenchmark.testMethod  15 thrpt 100 264965,412 ± 9335,923 ops/s 
L3CacheBenchmark.testMethod  16 thrpt 100 58830,825 ± 291,412 ops/s 
L3CacheBenchmark.testMethod  17 thrpt 100 255576,829 ± 7083,025 ops/s 
L3CacheBenchmark.testMethod  18 thrpt 100 324174,133 ± 2247,157 ops/s 
L3CacheBenchmark.testMethod  19 thrpt 100 212969,202 ± 18204,625 ops/s 
L3CacheBenchmark.testMethod  20 thrpt 100 295246,470 ± 1224,817 ops/s 
L3CacheBenchmark.testMethod  21 thrpt 100 251762,642 ± 23405,100 ops/s 
L3CacheBenchmark.testMethod  22 thrpt 100 323196,428 ± 2245,465 ops/s 
L3CacheBenchmark.testMethod  23 thrpt 100 254588,338 ± 23845,090 ops/s 
L3CacheBenchmark.testMethod  24 thrpt 100 53373,580 ± 252,183 ops/s 
L3CacheBenchmark.testMethod  25 thrpt 100 213220,459 ± 20440,716 ops/s 
L3CacheBenchmark.testMethod  26 thrpt 100 322625,597 ± 2076,341 ops/s 
L3CacheBenchmark.testMethod  27 thrpt 100 293643,720 ± 5260,010 ops/s 
L3CacheBenchmark.testMethod  28 thrpt 100 297432,240 ± 1186,920 ops/s 
L3CacheBenchmark.testMethod  29 thrpt 100 169277,701 ± 25040,239 ops/s 
L3CacheBenchmark.testMethod  30 thrpt 100 324230,899 ± 1579,103 ops/s 
L3CacheBenchmark.testMethod  31 thrpt 100 193981,979 ± 12478,424 ops/s 
L3CacheBenchmark.testMethod  32 thrpt 100 53761,030 ± 259,888 ops/s 
L3CacheBenchmark.testMethod  33 thrpt 100 213585,493 ± 23543,671 ops/s 
L3CacheBenchmark.testMethod  34 thrpt 100 325214,062 ± 1758,479 ops/s 
L3CacheBenchmark.testMethod  35 thrpt 100 306652,634 ± 2237,818 ops/s 
L3CacheBenchmark.testMethod  36 thrpt 100 297992,930 ± 1019,248 ops/s 
L3CacheBenchmark.testMethod  37 thrpt 100 181671,812 ± 21984,441 ops/s 
L3CacheBenchmark.testMethod  38 thrpt 100 321929,616 ± 1798,747 ops/s 
L3CacheBenchmark.testMethod  39 thrpt 100 251587,385 ± 12292,670 ops/s 
L3CacheBenchmark.testMethod  40 thrpt 100 49777,196 ± 227,620 ops/s

당신이 볼 수 있듯이, 처리량이 다른, 그리고 가장 눈에 띄는 차이는 8 크기의 다중와 배열을위한이 : 속도 저하가 거의 4 배 . 또한 예를 들어 크기가 37 Mb 인 어레이의 속도는 38 Mb보다 거의 두 배 적습니다. 나는 내 발견에 대해 논리적 인 설명을 찾지 못했습니다.

P. cpu i7 4700mq 6MB 캐시 : http://www.cpu-world.com/CPUs/Core_i7/Intel-Core%20i7-4700MQ%20Mobile%20processor.html

이 문제가 발생하는 이유는 무엇입니까?

출처

2017-04-01 vmolchanov

질문이 있으십니까? – chrylis

아마도 "이 문제의 원인은 무엇입니까?" –

당신은'perf stat'를 통해 흥미로운 사례를 실행하고 그것들을 비교해야합니다. 랩톱에서 CPU가 고정 클럭 속도로 실행되는지 확인해야합니다. 열 조절은 일관성없는 벤치 마크 결과를 초래할 수 있습니다. – the8472

cache associativity의 효과가 있습니다.

귀하의 CPU는 코어 당 연관 L2 캐시를 설정 2백56킬로바이트 8 방법이있다. 최대 8 개의 라인이 동일한 인덱스 비트를 가질 수있는 캐시 라인을 최대 256 KB / 64 개까지 저장할 수 있습니다.

벤치 마크 루프는 1025 개의 다른 주소에 기록합니다. 그러나 보폭에 따라 이러한 주소가 적은 수의 집합으로 떨어지며 캐시에서 충돌 및 축출이 발생할 수 있습니다. 스트라이드 (계단) = 8191, 16383, 24575 등의 경우에 일어나는 현상입니다.

이 이론을 확인하려면 JMH 벤치 마크를 -prof perfnorm 옵션으로 다시 실행하십시오. 여기
크기에 대한 통계가있다 = 8 크기 = 9 : 8, 거의 크기 없음 = 9이 데이터가 존재한다는 것을 의미하지 = 크기 1,035 :

L3CacheBenchmark.testMethod:CPI      8 thrpt  1.173 #/op 
L3CacheBenchmark.testMethod:L1-dcache-load-misses 8 thrpt 1048.088 #/op 
L3CacheBenchmark.testMethod:L1-dcache-loads   8 thrpt 1073.767 #/op 
L3CacheBenchmark.testMethod:L1-dcache-store-misses 8 thrpt 1049.491 #/op 
L3CacheBenchmark.testMethod:L1-dcache-stores   8 thrpt 1060.069 #/op 
L3CacheBenchmark.testMethod:L1-icache-load-misses 8 thrpt  1.209 #/op 
L3CacheBenchmark.testMethod:LLC-load-misses   8 thrpt  0.082 #/op 
L3CacheBenchmark.testMethod:LLC-loads    8 thrpt  1.399 #/op 
L3CacheBenchmark.testMethod:LLC-store-misses   8 thrpt  0.077 #/op 
L3CacheBenchmark.testMethod:LLC-stores    8 thrpt 1035.877 #/op 
L3CacheBenchmark.testMethod:branch-misses   8 thrpt  1.234 #/op 
L3CacheBenchmark.testMethod:branches     8 thrpt 2096.674 #/op 
L3CacheBenchmark.testMethod:cycles     8 thrpt 13520.964 #/op 
L3CacheBenchmark.testMethod:dTLB-load-misses   8 thrpt  0.057 #/op 
L3CacheBenchmark.testMethod:dTLB-loads    8 thrpt 1086.355 #/op 
L3CacheBenchmark.testMethod:dTLB-store-misses  8 thrpt  0.020 #/op 
L3CacheBenchmark.testMethod:dTLB-stores    8 thrpt 1068.579 #/op 
L3CacheBenchmark.testMethod:iTLB-load-misses   8 thrpt  0.044 #/op 
L3CacheBenchmark.testMethod:iTLB-loads    8 thrpt  0.018 #/op 
L3CacheBenchmark.testMethod:instructions    8 thrpt 11530.742 #/op 
L3CacheBenchmark.testMethod:stalled-cycles-backend 8 thrpt 8315.437 #/op 
L3CacheBenchmark.testMethod:stalled-cycles-frontend 8 thrpt 10359.447 #/op 

L3CacheBenchmark.testMethod:CPI      9 thrpt  0.871 #/op 
L3CacheBenchmark.testMethod:L1-dcache-load-misses 9 thrpt 1055.973 #/op 
L3CacheBenchmark.testMethod:L1-dcache-loads   9 thrpt 1068.958 #/op 
L3CacheBenchmark.testMethod:L1-dcache-store-misses 9 thrpt 1045.480 #/op 
L3CacheBenchmark.testMethod:L1-dcache-stores   9 thrpt 1057.328 #/op 
L3CacheBenchmark.testMethod:L1-icache-load-misses 9 thrpt  1.108 #/op 
L3CacheBenchmark.testMethod:LLC-load-misses   9 thrpt  0.174 #/op 
L3CacheBenchmark.testMethod:LLC-loads    9 thrpt  0.304 #/op 
L3CacheBenchmark.testMethod:LLC-store-misses   9 thrpt  0.045 #/op 
L3CacheBenchmark.testMethod:LLC-stores    9 thrpt  0.350 #/op 
L3CacheBenchmark.testMethod:branch-misses   9 thrpt  1.072 #/op 
L3CacheBenchmark.testMethod:branches     9 thrpt 2099.846 #/op 
L3CacheBenchmark.testMethod:cycles     9 thrpt 10041.724 #/op 
L3CacheBenchmark.testMethod:dTLB-load-misses   9 thrpt  0.086 #/op 
L3CacheBenchmark.testMethod:dTLB-loads    9 thrpt 1073.633 #/op 
L3CacheBenchmark.testMethod:dTLB-store-misses  9 thrpt  0.045 #/op 
L3CacheBenchmark.testMethod:dTLB-stores    9 thrpt 1054.587 #/op 
L3CacheBenchmark.testMethod:iTLB-load-misses   9 thrpt  0.044 #/op 
L3CacheBenchmark.testMethod:iTLB-loads    9 thrpt  0.037 #/op 
L3CacheBenchmark.testMethod:instructions    9 thrpt 11529.996 #/op 
L3CacheBenchmark.testMethod:stalled-cycles-backend 9 thrpt 3439.278 #/op 
L3CacheBenchmark.testMethod:stalled-cycles-frontend 9 thrpt 6888.714 #/op

가장 주목할만한 LLC-stores의 차이 stored는 L2 캐시에 맞지 않으며 L3으로 이동합니다. 이 데이터의 소량 (약 64킬로바이트)를 접촉하기 때문에

BTW, 당신의 벤치 마크는, L3 캐시의 효과를 측정 할 수 없습니다. 공정한 테스트를 수행하려면 할당 된 배열의 전체 범위를 읽고 쓸 필요가 있습니다.

출처

2017-04-02 03:18:02 apangin

Sergey Kuksenko - [Quantum Performance Effects II : 핵심을 뛰어 넘는] (https://www.youtube.com/watch?v=A-K1F3KtPsY)의 훌륭한 프레젠테이션이 있습니다. 데모 5는 정확히 비슷한 문제에 관한 것입니다. – apangin

감사합니다.이 프레젠테이션에서 수정해야 할 사항이 많이 있습니다. – vmolchanov

L3 CPU 캐시 자바 벤치 마크가 이상한 결과를 보여줍니다

답변

관련 문제