24) HW3 solution#

Part 1: Optimizing dot product (30%)#

(5%) The compile/runtime unrolling factor versions achieve performance similar to the dot_opt2 version, but not as good as the dot_opt3 version.

In addition to your commentary, answer the following questions:

  1. (5%) Is this code correct for all values of n?

    • No, for odd n, we incur in out-of-bound access when we perform the second product, a[i+1] * b[i+1].

  2. (5%) Can this code change the numerically computed result?

    • Yes, because of floating-point aritmethic (FPA). We’re adding an extra operation. If any of the numbers to add were exrtemely small (or large), we will suffer from roundoff errors. Also, as you’ve learned, the C standard does not allow floating point arithmetic to be reordered because it may change the computed values (non associativity).

  3. (5%) Can you extend this technique to increase performance further (longer vector registers and/or more instruction-level parallelism)?

    • Yes, we can definitely consider instead of an unrolling factor 2, chunking the data in larger sizes and apply SIMD-like operations (like we do in Part 2)

  4. (5%) Can you make the unrolling factor 2 a compile-time constant, perhaps by using an inner loop?

    • Yes, we can do that (see dot_opt_comp_time_m function)

  5. (5%) Could that unrolling factor be a run-time parameter?

    • Yes, we can do that (see dot_opt_run_time_m function). There are a couple of different ways this can be achieved.

      • If someone defined the runtime parameter, in say, main, then the compiler would get “smart” and optimize and treat it in the same way as a compile-time constant. Keep in mind that we’re compiling with the optimization flag -O3, so if we only used that unrolling factor in the call to our function in main, then the the compiler knows at compile time the value that we’re passing, and because we’re calling that function only once, it might just substitute the runtime constant as it were a constant known at compile time

      • But if someone actually added it as a true runtime unrolling factor, passed via a command-line argument, which the compiler cannot know in advance, then, yes, the performance of the runtime version would be lower than the compile-time counterpart.

! gcc -O3 -march=native -fopenmp  ../c_codes/hw3_solution/dot.c   -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000
  Name  	flops	ticks	flops/tick
 dot_ref	20000	119592	    0.17	
 dot_ref	20000	118928	    0.17	
 dot_ref	20000	118836	    0.17	
 dot_ref	20000	118546	    0.17	
 dot_ref	20000	118776	    0.17	
 dot_ref	20000	118562	    0.17	
 dot_ref	20000	118448	    0.17	
 dot_ref	20000	118484	    0.17	
 dot_ref	20000	118360	    0.17	
 dot_ref	20000	118412	    0.17	
dot_opt1	20000	1016736	    0.02	
dot_opt1	20000	52460	    0.38	
dot_opt1	20000	46042	    0.43	
dot_opt1	20000	45026	    0.44	
dot_opt1	20000	44474	    0.45	
dot_opt1	20000	44284	    0.45	
dot_opt1	20000	44320	    0.45	
dot_opt1	20000	44260	    0.45	
dot_opt1	20000	43464	    0.46	
dot_opt1	20000	44458	    0.45	
dot_opt2	20000	46100	    0.43	
dot_opt2	20000	45326	    0.44	
dot_opt2	20000	44222	    0.45	
dot_opt2	20000	44602	    0.45	
dot_opt2	20000	43978	    0.45	
dot_opt2	20000	46296	    0.43	
dot_opt2	20000	44100	    0.45	
dot_opt2	20000	46328	    0.43	
dot_opt2	20000	44364	    0.45	
dot_opt2	20000	44884	    0.45	
dot_opt3	20000	43628	    0.46	
dot_opt3	20000	41364	    0.48	
dot_opt3	20000	42590	    0.47	
dot_opt3	20000	40096	    0.50	
dot_opt3	20000	41600	    0.48	
dot_opt3	20000	40272	    0.50	
dot_opt3	20000	42280	    0.47	
dot_opt3	20000	41864	    0.48	
dot_opt3	20000	40860	    0.49	
dot_opt3	20000	40044	    0.50	
dot_opt_even_odd	20000	64004	    0.31	
dot_opt_even_odd	20000	64466	    0.31	
dot_opt_even_odd	20000	114844	    0.17	
dot_opt_even_odd	20000	66252	    0.30	
dot_opt_even_odd	20000	64298	    0.31	
dot_opt_even_odd	20000	64574	    0.31	
dot_opt_even_odd	20000	64980	    0.31	
dot_opt_even_odd	20000	64512	    0.31	
dot_opt_even_odd	20000	64618	    0.31	
dot_opt_even_odd	20000	64618	    0.31	
dot_opt_comp_time_m	20000	118654	    0.17	
dot_opt_comp_time_m	20000	68176	    0.29	
dot_opt_comp_time_m	20000	67624	    0.30	
dot_opt_comp_time_m	20000	67250	    0.30	
dot_opt_comp_time_m	20000	67262	    0.30	
dot_opt_comp_time_m	20000	67478	    0.30	
dot_opt_comp_time_m	20000	67544	    0.30	
dot_opt_comp_time_m	20000	67570	    0.30	
dot_opt_comp_time_m	20000	67430	    0.30	
dot_opt_comp_time_m	20000	67416	    0.30	
dot_opt_run_time_m	20000	111820	    0.18	
dot_opt_run_time_m	20000	110708	    0.18	
dot_opt_run_time_m	20000	110612	    0.18	
dot_opt_run_time_m	20000	110440	    0.18	
dot_opt_run_time_m	20000	110112	    0.18	
dot_opt_run_time_m	20000	153508	    0.13	
dot_opt_run_time_m	20000	111034	    0.18	
dot_opt_run_time_m	20000	110758	    0.18	
dot_opt_run_time_m	20000	110572	    0.18	
dot_opt_run_time_m	20000	110360	    0.18	

Part 2: Optimizing block inner product (%70)#

  • What does it mean to reorder loops? Will it help or hurt performance?

The two lines that originally use the j as slowest index and k as fastest index to perform the pair-wise products:

for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {

tell us that the same row of the matrix \(A^T\) is multiplied by all columns of \(B\), and we store the result in \(C\) row-wise.

If we swapped these two lines, (see the bdot_opt3_simd_reordered function where we do this), we would multiply the same column of \(B\) by all rows of \(A^T\) and we’d store the result in \(C\) column-wise.

The effect of doing so it is actually negligible on performance (on my architecture) and might bring only a very minimal advantage on some other architectures. If you think about it, this would only change how 32 pair-wise dot products are performed, but the majority of the work in this code is done in the single pair-wise dot products (20000 FLOPs each).

  • Does it help to change the layout in memory (see the aistride, ajstride, bistride, and bkstride parameters in the code)?

No, doing it without further algorithmic changes in the functions where we actually perform the dot products, does not help. First of all, the only way the stride parameters can be changed is in the call of the init_bdot function. Instead of calling init_bdot as we called it originally:

init_bdot(args.length, a, n, 1, b, 1, n);

one of the two arguments between n and 1 has to be kept to 1 to fill the whole matrices A and B without any gaps, as in:

init_bdot(args.length, a, 1, n, b, n, 1);

This would make A be stored column-wise and B stored row-wise (effectively transposing the matrices). If we did this, in order for the algorithm to still produce the same results of the pairwise dot products, we would also need to perform further algorithmic changes. Even if we did perform those further algorithmic changes to make the results correct, it would not help performance, as we would be back to the original situation.

One might have also thought to use smaller blocks/stride patterns, such as 1, J and 1, K. However, again, this would only change how the data entries are laid out, and without further algorithmic changes in the functions that actually call the pair-wise dot products (which are implemented to perform the dot product of the entire vectors), it would not be beneficial.

  • Try using the #pragma omp simd directive seen in class and the compiler option -fopenmp-simd.

See all the bdot_opt*_simd versions and related results (where * is 1,2,3). The addition of the #pragma omp simd directive only significantly helps the bdot_opt3 optimized version, and leaves the bdot_opt1 and bdot_opt2 results essentially unaltered.

! gcc -O3 -march=native -fopenmp-simd -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000 -b
Name    	flops	ticks	flops/tick
bdot_ref	640000	3743062	    0.17	
bdot_ref	640000	3746338	    0.17	
bdot_ref	640000	3713740	    0.17	
bdot_ref	640000	3782576	    0.17	
bdot_ref	640000	3710182	    0.17	
bdot_ref	640000	3733034	    0.17	
bdot_ref	640000	3776882	    0.17	
bdot_ref	640000	3732790	    0.17	
bdot_ref	640000	3707136	    0.17	
bdot_ref	640000	3712336	    0.17	
bdot_opt1	640000	2406724	    0.27	
bdot_opt1	640000	1503544	    0.43	
bdot_opt1	640000	1401484	    0.46	
bdot_opt1	640000	1348650	    0.47	
bdot_opt1	640000	1387612	    0.46	
bdot_opt1	640000	1391460	    0.46	
bdot_opt1	640000	1613596	    0.40	
bdot_opt1	640000	1479270	    0.43	
bdot_opt1	640000	1362592	    0.47	
bdot_opt1	640000	1391172	    0.46	
bdot_opt2	640000	1362786	    0.47	
bdot_opt2	640000	1380144	    0.46	
bdot_opt2	640000	1371448	    0.47	
bdot_opt2	640000	1385212	    0.46	
bdot_opt2	640000	1361832	    0.47	
bdot_opt2	640000	1386840	    0.46	
bdot_opt2	640000	1363798	    0.47	
bdot_opt2	640000	1378826	    0.46	
bdot_opt2	640000	1364462	    0.47	
bdot_opt2	640000	1367900	    0.47	
bdot_opt3	640000	1348248	    0.47	
bdot_opt3	640000	1312500	    0.49	
bdot_opt3	640000	1336318	    0.48	
bdot_opt3	640000	1319916	    0.48	
bdot_opt3	640000	1343640	    0.48	
bdot_opt3	640000	1321822	    0.48	
bdot_opt3	640000	1348920	    0.47	
bdot_opt3	640000	1322138	    0.48	
bdot_opt3	640000	1332408	    0.48	
bdot_opt3	640000	1317314	    0.49	
bdot_opt1_simd	640000	1387644	    0.46	
bdot_opt1_simd	640000	1363788	    0.47	
bdot_opt1_simd	640000	1469780	    0.44	
bdot_opt1_simd	640000	1318188	    0.49	
bdot_opt1_simd	640000	1330876	    0.48	
bdot_opt1_simd	640000	1365258	    0.47	
bdot_opt1_simd	640000	1320088	    0.48	
bdot_opt1_simd	640000	1416358	    0.45	
bdot_opt1_simd	640000	1320894	    0.48	
bdot_opt1_simd	640000	1342492	    0.48	
bdot_opt2_simd	640000	1308918	    0.49	
bdot_opt2_simd	640000	1340988	    0.48	
bdot_opt2_simd	640000	1306204	    0.49	
bdot_opt2_simd	640000	1357556	    0.47	
bdot_opt2_simd	640000	1306916	    0.49	
bdot_opt2_simd	640000	1337742	    0.48	
bdot_opt2_simd	640000	1311456	    0.49	
bdot_opt2_simd	640000	1337606	    0.48	
bdot_opt2_simd	640000	1303892	    0.49	
bdot_opt2_simd	640000	1325636	    0.48	
bdot_opt3_simd	640000	561452	    1.14	
bdot_opt3_simd	640000	557208	    1.15	
bdot_opt3_simd	640000	581866	    1.10	
bdot_opt3_simd	640000	560198	    1.14	
bdot_opt3_simd	640000	552394	    1.16	
bdot_opt3_simd	640000	554614	    1.15	
bdot_opt3_simd	640000	585024	    1.09	
bdot_opt3_simd	640000	561434	    1.14	
bdot_opt3_simd	640000	565898	    1.13	
bdot_opt3_simd	640000	556436	    1.15	
bdot_opt3_simd_reordered	640000	561168	    1.14	
bdot_opt3_simd_reordered	640000	587682	    1.09	
bdot_opt3_simd_reordered	640000	561772	    1.14	
bdot_opt3_simd_reordered	640000	556896	    1.15	
bdot_opt3_simd_reordered	640000	557428	    1.15	
bdot_opt3_simd_reordered	640000	597996	    1.07	
bdot_opt3_simd_reordered	640000	552732	    1.16	
bdot_opt3_simd_reordered	640000	555214	    1.15	
bdot_opt3_simd_reordered	640000	579458	    1.10	
bdot_opt3_simd_reordered	640000	596628	    1.07	

Code:#

  1#include "rdtsc.h"
  2
  3#include <argp.h>
  4#include <math.h>
  5#include <omp.h>
  6#include <stdbool.h>
  7#include <stdio.h>
  8#include <stdlib.h>
  9#include <string.h>
 10
 11struct Args {
 12  size_t length;
 13  size_t nreps;
 14  bool block;
 15  size_t unroll_factor;
 16};
 17
 18static struct argp_option options[] = {
 19  {"length", 'n', "size_t", 0, "Length of each vector"},
 20  {"nreps", 'r', "size_t", 0, "Number of repetitions"},
 21  {"block", 'b', NULL, 0, "Compute block dot products (versus a single dot product)"},
 22  {"unroll_factor", 'm', "size_t", 0, "Runtime unrolling factor for dot product"},
 23};
 24
 25static error_t parse_opt (int key, char *arg, struct argp_state *state)
 26{
 27  struct Args *args = state->input;
 28  switch (key) {
 29  case ARGP_KEY_INIT:
 30    args->length = 100;
 31    args->nreps = 10;
 32    args->block = false;
 33    args->unroll_factor = 4;
 34    break;
 35  case 'n':
 36    args->length = strtol(arg, NULL, 10);
 37    break;
 38  case 'r':
 39    args->nreps = strtol(arg, NULL, 10);
 40    break;
 41  case 'b':
 42    args->block = true;
 43    break;
 44  case 'm':
 45    args->unroll_factor = strtol(arg, NULL, 10);
 46    break;
 47  default:
 48    return ARGP_ERR_UNKNOWN;
 49  }
 50  return 0;
 51}
 52
 53// This part is to answer Part 1
 54#define COMP_TIME_M 4
 55
 56// dot_opt_comp_time_m uses a compile-time unrolling factor (Part 1, Q4)
 57double dot_opt_comp_time_m(size_t n, const double *a, const double *b) {
 58
 59  // check if n is divided evenly by the compile-time unrolling factor
 60  if (n % COMP_TIME_M != 0) {
 61    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
 62    return -1;
 63  }
 64
 65  double sum = 0, sums[n / COMP_TIME_M];
 66  memset(sums, 0, sizeof(sums));
 67
 68  for (size_t i=0; i<n; i+=COMP_TIME_M) {
 69    for (size_t j=0; j<COMP_TIME_M; j++) {
 70      sums[j] += a[i+j] * b[i+j];
 71    }
 72  }
 73  for (size_t j=0; j < n/COMP_TIME_M; j++) {
 74    sum += sums[j];
 75  }
 76  return sum;
 77}
 78
 79// dot_opt_run_time_m uses a runtime unrolling factor (Part 1, Q5)
 80double dot_opt_run_time_m(size_t n, const double *a, const double *b, const size_t m) {
 81
 82  // check if n is divided evenly by the compile-time unrolling factor
 83  if (n % m != 0) {
 84    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
 85    return -1;
 86  }
 87
 88  double sum = 0, sums[n / m];
 89
 90  // initialize the dynamically created sums array to zero
 91  memset(sums, 0, sizeof(sums));
 92  // printf("Before for-loop \n");
 93  for (size_t i=0; i<n; i+=m) {
 94    // printf("Inside first for-loop \n");
 95    for (size_t j=0; j < m; j++) {
 96      // printf("Inside 2nd for-loop. I is %d, j is %d \n", i, j);
 97      sums[j] += a[i+j] * b[i+j];
 98    }
 99  }
100  for (size_t j=0; j < n/m; j++) {
101    // printf("Inside the reduction for-loop. j is %d \n", j);
102    sum += sums[j];
103  }
104  return sum;
105}
106// This part ends to answer Part 1
107
108__attribute((noinline))
109static double dot_ref(size_t n, const double *a, const double *b) {
110  double sum = 0;
111  for (size_t i=0; i<n; i++)
112    sum += a[i] * b[i];
113  return sum;
114}
115
116// The following even-odd function given in the assignment tries to alleviate some data dependency from one loop iteration to another.
117// Note: the following function is NOT correct for all values of n. In fact, for n odd we incur in out-of-bound access when we perform the second partial sum sum1 += a[i+1] * b[i+1]
118double dot_opt_even_odd(size_t n, const double *a, const double *b) {
119  double sum0 = 0, sum1 = 0;
120  for (size_t i=0; i<n; i+=2) {
121    sum0 += a[i+0] * b[i+0];
122    sum1 += a[i+1] * b[i+1];
123  }
124  return sum0 + sum1;
125}
126
127__attribute((noinline))
128static double dot_opt1(size_t n, const double *a, const double *b) {
129  double sums[4] = {};
130  omp_set_num_threads(4);
131  #pragma omp parallel
132  {
133    int id = omp_get_thread_num();
134    for (size_t i=id; i<n; i+=4)
135      sums[id] += a[i] * b[i];
136  }
137  for (size_t j=1; j<4; j++) sums[0] += sums[j];
138  return sums[0];
139}
140
141// dot_opt1_simd is the dot_opt1 version to which we added the #pragma omp simd directive
142__attribute((noinline))
143static double dot_opt1_simd(size_t n, const double *a, const double *b) {
144  double sums[4] = {};
145  omp_set_num_threads(4);
146  #pragma omp parallel
147  {
148    int id = omp_get_thread_num();
149    #pragma omp simd
150    for (size_t i=id; i<n; i+=4)
151      sums[id] += a[i] * b[i];
152  }
153  #pragma omp simd
154  for (size_t j=1; j<4; j++) sums[0] += sums[j];
155  return sums[0];
156}
157
158__attribute((noinline))
159static double dot_opt2(size_t n, const double *a, const double *b) {
160  double sums[4] = {};
161  omp_set_num_threads(4);
162  #pragma omp parallel
163  {
164    int id = omp_get_thread_num();
165    #pragma omp for
166    for (size_t i=0; i<n; i++)
167      sums[id] += a[i] * b[i];
168  }
169  for (size_t j=1; j<4; j++) sums[0] += sums[j];
170  return sums[0];
171}
172
173// dot_opt2_simd is the dot_opt2 version to which we added the #pragma omp simd directive
174__attribute((noinline))
175static double dot_opt2_simd(size_t n, const double *a, const double *b) {
176  double sums[4] = {};
177  omp_set_num_threads(4);
178  #pragma omp parallel
179  {
180    int id = omp_get_thread_num();
181    #pragma omp for simd
182    for (size_t i=0; i<n; i++)
183      sums[id] += a[i] * b[i];
184  }
185  #pragma omp simd
186  for (size_t j=1; j<4; j++) sums[0] += sums[j];
187  return sums[0];
188}
189
190__attribute((noinline))
191static double dot_opt3(size_t n, const double *a, const double *b) {
192  double sum = 0;
193  omp_set_num_threads(4);
194  #pragma omp parallel
195  {
196    #pragma omp for reduction(+:sum)
197    for (size_t i=0; i<n; i++)
198      sum += a[i] * b[i];
199  }
200  return sum;
201}
202
203// dot_opt3_simd is the dot_opt3 version to which we added the #pragma omp simd directive
204__attribute((noinline))
205static double dot_opt3_simd(size_t n, const double *a, const double *b) {
206  double sum = 0;
207  omp_set_num_threads(4);
208  #pragma omp parallel
209  {
210    #pragma omp for simd reduction(+:sum)
211    for (size_t i=0; i<n; i++)
212      sum += a[i] * b[i];
213  }
214  return sum;
215}
216
217static void report_dot(const char *name, ticks_t start_ticks, size_t flops, double result) {
218  ticks_t ticks = rdtsc() - start_ticks;
219  double rate = 1.*flops / ticks;
220  if (fabs(result - flops) > 1e-10)
221    printf("Result %f failed to validate with expected value %ld\n", result, flops);
222  printf("%8s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
223}
224
225#define REPORT_DOT(f) do {                                              \
226    for (int rep=0; rep<args.nreps; rep++) {                            \
227      ticks_t ticks_start = rdtsc();                                    \
228      report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b)); \
229    }                                                                   \
230  } while (0)
231
232// The following macro is to test the perfomance of the runtime loop unrolling constant to answer Part 1.5
233#define REPORT_DOT_RUNTIME(f, m) do {                                              \
234  for (int rep=0; rep<args.nreps; rep++) {                            \
235    ticks_t ticks_start = rdtsc();                                    \
236    report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b, m)); \
237  }                                                                   \
238} while (0)
239
240// Dimensions of the matrices for block dot products.
241#define J 8
242#define K 4
243
244// Performs the operation
245//   C = A * B
246// where A and B have shape (J,n) and (n,K) respectively.
247// This reference version stores A as row-major and B as column-major.
248static void bdot_ref(size_t n, const double *a, const double *b, double *c) {
249  for (size_t j=0; j<J; j++) {
250    for (size_t k=0; k<K; k++) {
251      c[j*K+k] = dot_ref(n, &a[j*n], &b[k*n]);
252    }
253  }
254}
255
256// bdot_opt1 internally calls dot_opt1
257static void bdot_opt1(size_t n, const double *a, const double *b, double *c) {
258
259  for (size_t j=0; j<J; j++) {
260    for (size_t k=0; k<K; k++) {
261      c[j*K+k] = dot_opt1(n, &a[j*n], &b[k*n]);
262    }
263  }
264}
265
266// bdot_opt1_simd internally calls dot_opt1_simd
267static void bdot_opt1_simd(size_t n, const double *a, const double *b, double *c) {
268
269  for (size_t j=0; j<J; j++) {
270    for (size_t k=0; k<K; k++) {
271      c[j*K+k] = dot_opt1_simd(n, &a[j*n], &b[k*n]);
272    }
273  }
274}
275
276// bdot_opt2 internally calls dot_opt2
277static void bdot_opt2(size_t n, const double *a, const double *b, double *c) {
278
279  for (size_t j=0; j<J; j++) {
280    for (size_t k=0; k<K; k++) {
281      c[j*K+k] = dot_opt2(n, &a[j*n], &b[k*n]);
282    }
283  }
284}
285
286// bdot_opt2_simd internally calls dot_opt2_simd
287static void bdot_opt2_simd(size_t n, const double *a, const double *b, double *c) {
288
289  for (size_t j=0; j<J; j++) {
290    for (size_t k=0; k<K; k++) {
291      c[j*K+k] = dot_opt2_simd(n, &a[j*n], &b[k*n]);
292    }
293  }
294}
295
296// bdot_opt3 internally calls dot_opt3
297static void bdot_opt3(size_t n, const double *a, const double *b, double *c) {
298
299  for (size_t j=0; j<J; j++) {
300    for (size_t k=0; k<K; k++) {
301      c[j*K+k] = dot_opt3(n, &a[j*n], &b[k*n]);
302    }
303  }
304}
305
306// bdot_opt3_simd internally calls dot_opt3_simd
307static void bdot_opt3_simd(size_t n, const double *a, const double *b, double *c) {
308
309  for (size_t j=0; j<J; j++) {
310    for (size_t k=0; k<K; k++) {
311      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
312    }
313  }
314}
315
316// bdot_opt3_simd_reordered performs the J * K (=32) pairwise products swapping the k,j indices
317static void bdot_opt3_simd_reordered(size_t n, const double *a, const double *b, double *c) {
318
319  for (size_t k=0; k<K; k++) {
320    for (size_t j=0; j<J; j++) {
321      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
322    }
323  }
324}
325
326
327static void init_bdot(size_t n, double *a, size_t ajstride, size_t aistride,
328                      double *b, size_t bistride, size_t bkstride) {
329  for (size_t i=0; i<n; i++) {
330    for (size_t j=0; j<J; j++)
331      a[i*aistride + j*ajstride] = 1000*(i+1) + j+1;
332    for (size_t k=0; k<K; k++)
333      b[i*bistride + k*bkstride] = 1./(1000*(i+1) + k+1);
334  }
335}
336
337static void report_bdot(const char *name, ticks_t start_ticks, size_t flops,
338                        const double *result, int jstride, int kstride,
339                        const double *ref_result) {
340  ticks_t ticks = rdtsc() - start_ticks;
341  double rate = 1.*flops / ticks;
342  if (result && ref_result && result != ref_result) {
343    for (int j=0; j<J; j++) {
344      for (int k=0; k<K; k++) {
345        if (fabs(result[j*jstride + k*kstride] - ref_result[j*K+k]) > 1e-10) {
346          printf("Result[%d,%d] = %f failed to validate with expected value %f\n", j, k, result[j*jstride + k*kstride], ref_result[j*K+k]);
347        }
348      }
349    }
350  }
351  printf("%s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
352}
353
354#define REPORT_BDOT(f, c, jstride, kstride, c_ref) do {                 \
355    for (int rep=0; rep<args.nreps; rep++) {                            \
356      ticks_t ticks_start = rdtsc();                                    \
357      f(args.length, a, b, c);                                          \
358      report_bdot(#f, ticks_start, 2*J*K*args.length, c, jstride, kstride, c_ref); \
359    }                                                                   \
360  } while (0)
361
362int main(int argc, char **argv) {
363  struct Args args;
364  struct argp argp = {options, parse_opt, NULL, NULL};
365  argp_parse(&argp, argc, argv, 0, 0, &args);
366  size_t n = args.length;
367
368  switch (args.block) {
369  case false: {
370    double *a = malloc(n * sizeof(double));
371    double *b = malloc(n * sizeof(double));
372    for (size_t i=0; i<n; i++) {
373      a[i] = 2.*(i+1);
374      b[i] = 1./(i+1);
375    }
376
377    printf("  Name  \tflops\tticks\tflops/tick\n");
378    REPORT_DOT(dot_ref);
379    REPORT_DOT(dot_opt1);
380    REPORT_DOT(dot_opt2);
381    REPORT_DOT(dot_opt3);
382    // To answer Part 1, we will test the excution for the even-odd, and compile/runtime loop unrolling constant versions
383    REPORT_DOT(dot_opt_even_odd);
384    REPORT_DOT(dot_opt_comp_time_m);
385    REPORT_DOT_RUNTIME(dot_opt_run_time_m, args.unroll_factor);
386
387    free(a); free(b);
388  } break;
389  case true: {
390    // Initialize the matrices (as flattened vectors)
391    double *a = malloc(J * n * sizeof(double));
392    double *b = malloc(K * n * sizeof(double));
393    double *c = malloc(J * K * sizeof(double));
394    double *c_ref = malloc(J * K * sizeof(double));
395
396    printf("Name    \tflops\tticks\tflops/tick\n");
397    init_bdot(args.length, a, n, 1, b, 1, n);
398    REPORT_BDOT(bdot_ref, c_ref, K, 1, c_ref);
399
400    // You may initialize a and b differently and call more variants by editing
401    // the two lines below, or by creating new variants.
402    init_bdot(args.length, a, n, 1, b, 1, n);
403    // To answer Part 2, we will test the excution for the three different blocked opt versions
404    REPORT_BDOT(bdot_opt1, c, K, 1, c_ref);
405    REPORT_BDOT(bdot_opt2, c, K, 1, c_ref);
406    REPORT_BDOT(bdot_opt3, c, K, 1, c_ref);
407    REPORT_BDOT(bdot_opt1_simd, c, K, 1, c_ref);
408    REPORT_BDOT(bdot_opt2_simd, c, K, 1, c_ref);
409    REPORT_BDOT(bdot_opt3_simd, c, K, 1, c_ref);
410    REPORT_BDOT(bdot_opt3_simd_reordered, c, K, 1, c_ref);
411
412    free(a); free(b); free(c); free(c_ref);
413  } break;
414  }
415  return 0;
416}