24) HW3 solution#

Part 1: Optimizing dot product (30%)#

(5%) The compile/runtime unrolling factor versions achieve performance similar to the dot_opt2 version, but not as good as the dot_opt3 version.

In addition to your commentary, answer the following questions:

  1. (5%) Is this code correct for all values of n?

    • No, for odd n, we incur in out-of-bound access when we perform the second product, a[i+1] * b[i+1].

  2. (5%) Can this code change the numerically computed result?

    • Yes, because of floating-point aritmethic (FPA). We’re adding an extra operation. If any of the numbers to add were exrtemely small (or large), we will suffer from roundoff errors. Also, as you’ve learned, the C standard does not allow floating point arithmetic to be reordered because it may change the computed values (non associativity).

  3. (5%) Can you extend this technique to increase performance further (longer vector registers and/or more instruction-level parallelism)?

    • Yes, we can definitely consider instead of an unrolling factor 2, chunking the data in larger sizes and apply SIMD-like operations (like we do in Part 2)

  4. (5%) Can you make the unrolling factor 2 a compile-time constant, perhaps by using an inner loop?

    • Yes, we can do that (see dot_opt_comp_time_m function)

  5. (5%) Could that unrolling factor be a run-time parameter?

    • Yes, we can do that (see dot_opt_run_time_m function). There are a couple of different ways this can be achieved.

      • If someone defined the runtime parameter, in say, main, then the compiler would get “smart” and optimize and treat it in the same way as a compile-time constant. Keep in mind that we’re compiling with the optimization flag -O3, so if we only used that unrolling factor in the call to our function in main, then the the compiler knows at compile time the value that we’re passing, and because we’re calling that function only once, it might just substitute the runtime constant as it were a constant known at compile time

      • But if someone actually added it as a true runtime unrolling factor, passed via a command-line argument, which the compiler cannot know in advance, then, yes, the performance of the runtime version would be lower than the compile-time counterpart.

! gcc -O3 -march=native -fopenmp  ../c_codes/hw3_solution/dot.c   -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000
  Name  	flops	ticks	flops/tick
 dot_ref	20000	57800	    0.35	
 dot_ref	20000	57616	    0.35	
 dot_ref	20000	57626	    0.35	
 dot_ref	20000	57632	    0.35	
 dot_ref	20000	57564	    0.35	
 dot_ref	20000	57562	    0.35	
 dot_ref	20000	57614	    0.35	
 dot_ref	20000	57634	    0.35	
 dot_ref	20000	57560	    0.35	
 dot_ref	20000	57548	    0.35	
dot_opt1	20000	1757938	    0.01	
dot_opt1	20000	581676	    0.03	
dot_opt1	20000	696132	    0.03	
dot_opt1	20000	885786	    0.02	
dot_opt1	20000	865362	    0.02	
dot_opt1	20000	787246	    0.03	
dot_opt1	20000	599558	    0.03	
dot_opt1	20000	63432	    0.32	
dot_opt1	20000	67446	    0.30	
dot_opt1	20000	581704	    0.03	
dot_opt2	20000	1080788	    0.02	
dot_opt2	20000	874322	    0.02	
dot_opt2	20000	610828	    0.03	
dot_opt2	20000	1017028	    0.02	
dot_opt2	20000	885956	    0.02	
dot_opt2	20000	766450	    0.03	
dot_opt2	20000	991940	    0.02	
dot_opt2	20000	1187360	    0.02	
dot_opt2	20000	82834	    0.24	
dot_opt2	20000	71780	    0.28	
dot_opt3	20000	112104	    0.18	
dot_opt3	20000	667324	    0.03	
dot_opt3	20000	486988	    0.04	
dot_opt3	20000	651768	    0.03	
dot_opt3	20000	611812	    0.03	
dot_opt3	20000	658816	    0.03	
dot_opt3	20000	521840	    0.04	
dot_opt3	20000	587960	    0.03	
dot_opt3	20000	763112	    0.03	
dot_opt3	20000	79924	    0.25	
dot_opt_even_odd	20000	44448	    0.45	
dot_opt_even_odd	20000	43978	    0.45	
dot_opt_even_odd	20000	44952	    0.44	
dot_opt_even_odd	20000	44940	    0.45	
dot_opt_even_odd	20000	44824	    0.45	
dot_opt_even_odd	20000	44940	    0.45	
dot_opt_even_odd	20000	44444	    0.45	
dot_opt_even_odd	20000	44510	    0.45	
dot_opt_even_odd	20000	44020	    0.45	
dot_opt_even_odd	20000	44188	    0.45	
dot_opt_comp_time_m	20000	96980	    0.21	
dot_opt_comp_time_m	20000	34592	    0.58	
dot_opt_comp_time_m	20000	34528	    0.58	
dot_opt_comp_time_m	20000	34572	    0.58	
dot_opt_comp_time_m	20000	34418	    0.58	
dot_opt_comp_time_m	20000	34344	    0.58	
dot_opt_comp_time_m	20000	34192	    0.58	
dot_opt_comp_time_m	20000	34320	    0.58	
dot_opt_comp_time_m	20000	34184	    0.59	
dot_opt_comp_time_m	20000	34284	    0.58	
dot_opt_run_time_m	20000	100080	    0.20	
dot_opt_run_time_m	20000	99796	    0.20	
dot_opt_run_time_m	20000	99760	    0.20	
dot_opt_run_time_m	20000	99720	    0.20	
dot_opt_run_time_m	20000	99724	    0.20	
dot_opt_run_time_m	20000	99696	    0.20	
dot_opt_run_time_m	20000	99716	    0.20	
dot_opt_run_time_m	20000	99740	    0.20	
dot_opt_run_time_m	20000	99700	    0.20	
dot_opt_run_time_m	20000	99704	    0.20	

Part 2: Optimizing block inner product (%70)#

  • What does it mean to reorder loops? Will it help or hurt performance?

The two lines that originally use the j as slowest index and k as fastest index to perform the pair-wise products:

for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {

tell us that the same row of the matrix \(A^T\) is multiplied by all columns of \(B\), and we store the result in \(C\) row-wise.

If we swapped these two lines, (see the bdot_opt3_simd_reordered function where we do this), we would multiply the same column of \(B\) by all rows of \(A^T\) and we’d store the result in \(C\) column-wise.

The effect of doing so it is actually negligible on performance (on my architecture) and might bring only a very minimal advantage on some other architectures. If you think about it, this would only change how 32 pair-wise dot products are performed, but the majority of the work in this code is done in the single pair-wise dot products (20000 FLOPs each).

  • Does it help to change the layout in memory (see the aistride, ajstride, bistride, and bkstride parameters in the code)?

No, doing it without further algorithmic changes in the functions where we actually perform the dot products, does not help. First of all, the only way the stride parameters can be changed is in the call of the init_bdot function. Instead of calling init_bdot as we called it originally:

init_bdot(args.length, a, n, 1, b, 1, n);

one of the two arguments between n and 1 has to be kept to 1 to fill the whole matrices A and B without any gaps, as in:

init_bdot(args.length, a, 1, n, b, n, 1);

This would make A be stored column-wise and B stored row-wise (effectively transposing the matrices). If we did this, in order for the algorithm to still produce the same results of the pairwise dot products, we would also need to perform further algorithmic changes. Even if we did perform those further algorithmic changes to make the results correct, it would not help performance, as we would be back to the original situation.

One might have also thought to use smaller blocks/stride patterns, such as 1, J and 1, K. However, again, this would only change how the data entries are laid out, and without further algorithmic changes in the functions that actually call the pair-wise dot products (which are implemented to perform the dot product of the entire vectors), it would not be beneficial.

  • Try using the #pragma omp simd directive seen in class and the compiler option -fopenmp-simd.

See all the bdot_opt*_simd versions and related results (where * is 1,2,3). The addition of the #pragma omp simd directive only significantly helps the bdot_opt3 optimized version, and leaves the bdot_opt1 and bdot_opt2 results essentially unaltered.

! gcc -O3 -march=native -fopenmp-simd -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000 -b
Name    	flops	ticks	flops/tick
bdot_ref	640000	1553330	    0.41	
bdot_ref	640000	1482816	    0.43	
bdot_ref	640000	1473188	    0.43	
bdot_ref	640000	2491487	    0.26	
bdot_ref	640000	984068	    0.65	
bdot_ref	640000	1002864	    0.64	
bdot_ref	640000	980176	    0.65	
bdot_ref	640000	979916	    0.65	
bdot_ref	640000	1002372	    0.64	
bdot_ref	640000	983882	    0.65	
bdot_opt1	640000	2992558	    0.21	
bdot_opt1	640000	1881600	    0.34	
bdot_opt1	640000	1447478	    0.44	
bdot_opt1	640000	1534196	    0.42	
bdot_opt1	640000	1449300	    0.44	
bdot_opt1	640000	1481620	    0.43	
bdot_opt1	640000	1558464	    0.41	
bdot_opt1	640000	1465784	    0.44	
bdot_opt1	640000	1427708	    0.45	
bdot_opt1	640000	2893732	    0.22	
bdot_opt2	640000	3470022	    0.18	
bdot_opt2	640000	2293348	    0.28	
bdot_opt2	640000	2234964	    0.29	
bdot_opt2	640000	2242024	    0.29	
bdot_opt2	640000	2182178	    0.29	
bdot_opt2	640000	2213002	    0.29	
bdot_opt2	640000	2189554	    0.29	
bdot_opt2	640000	2258316	    0.28	
bdot_opt2	640000	2259706	    0.28	
bdot_opt2	640000	2250576	    0.28	
bdot_opt3	640000	2737568	    0.23	
bdot_opt3	640000	1605498	    0.40	
bdot_opt3	640000	1600662	    0.40	
bdot_opt3	640000	1587474	    0.40	
bdot_opt3	640000	1579622	    0.41	
bdot_opt3	640000	1580854	    0.40	
bdot_opt3	640000	1587928	    0.40	
bdot_opt3	640000	1795260	    0.36	
bdot_opt3	640000	1656106	    0.39	
bdot_opt3	640000	1523864	    0.42	
bdot_opt1_simd	640000	1655884	    0.39	
bdot_opt1_simd	640000	1654972	    0.39	
bdot_opt1_simd	640000	1656084	    0.39	
bdot_opt1_simd	640000	1657640	    0.39	
bdot_opt1_simd	640000	1624626	    0.39	
bdot_opt1_simd	640000	1581086	    0.40	
bdot_opt1_simd	640000	1623642	    0.39	
bdot_opt1_simd	640000	1671630	    0.38	
bdot_opt1_simd	640000	1575028	    0.41	
bdot_opt1_simd	640000	1586240	    0.40	
bdot_opt2_simd	640000	2220036	    0.29	
bdot_opt2_simd	640000	2211386	    0.29	
bdot_opt2_simd	640000	2267582	    0.28	
bdot_opt2_simd	640000	2267962	    0.28	
bdot_opt2_simd	640000	2270632	    0.28	
bdot_opt2_simd	640000	2211588	    0.29	
bdot_opt2_simd	640000	2207122	    0.29	
bdot_opt2_simd	640000	2232456	    0.29	
bdot_opt2_simd	640000	2240136	    0.29	
bdot_opt2_simd	640000	2225774	    0.29	
bdot_opt3_simd	640000	1489792	    0.43	
bdot_opt3_simd	640000	1467498	    0.44	
bdot_opt3_simd	640000	1378286	    0.46	
bdot_opt3_simd	640000	1399970	    0.46	
bdot_opt3_simd	640000	1330830	    0.48	
bdot_opt3_simd	640000	1436224	    0.45	
bdot_opt3_simd	640000	1444604	    0.44	
bdot_opt3_simd	640000	1393742	    0.46	
bdot_opt3_simd	640000	1316022	    0.49	
bdot_opt3_simd	640000	1350594	    0.47	
bdot_opt3_simd_reordered	640000	1312090	    0.49	
bdot_opt3_simd_reordered	640000	1354006	    0.47	
bdot_opt3_simd_reordered	640000	1335878	    0.48	
bdot_opt3_simd_reordered	640000	1303826	    0.49	
bdot_opt3_simd_reordered	640000	1390632	    0.46	
bdot_opt3_simd_reordered	640000	1333384	    0.48	
bdot_opt3_simd_reordered	640000	1371484	    0.47	
bdot_opt3_simd_reordered	640000	1316464	    0.49	
bdot_opt3_simd_reordered	640000	1303554	    0.49	
bdot_opt3_simd_reordered	640000	1357476	    0.47	

Code:#

  1#include "rdtsc.h"
  2
  3#include <argp.h>
  4#include <math.h>
  5#include <omp.h>
  6#include <stdbool.h>
  7#include <stdio.h>
  8#include <stdlib.h>
  9#include <string.h>
 10
 11struct Args {
 12  size_t length;
 13  size_t nreps;
 14  bool block;
 15  size_t unroll_factor;
 16};
 17
 18static struct argp_option options[] = {
 19  {"length", 'n', "size_t", 0, "Length of each vector"},
 20  {"nreps", 'r', "size_t", 0, "Number of repetitions"},
 21  {"block", 'b', NULL, 0, "Compute block dot products (versus a single dot product)"},
 22  {"unroll_factor", 'm', "size_t", 0, "Runtime unrolling factor for dot product"},
 23};
 24
 25static error_t parse_opt (int key, char *arg, struct argp_state *state)
 26{
 27  struct Args *args = state->input;
 28  switch (key) {
 29  case ARGP_KEY_INIT:
 30    args->length = 100;
 31    args->nreps = 10;
 32    args->block = false;
 33    args->unroll_factor = 4;
 34    break;
 35  case 'n':
 36    args->length = strtol(arg, NULL, 10);
 37    break;
 38  case 'r':
 39    args->nreps = strtol(arg, NULL, 10);
 40    break;
 41  case 'b':
 42    args->block = true;
 43    break;
 44  case 'm':
 45    args->unroll_factor = strtol(arg, NULL, 10);
 46    break;
 47  default:
 48    return ARGP_ERR_UNKNOWN;
 49  }
 50  return 0;
 51}
 52
 53// This part is to answer Part 1
 54#define COMP_TIME_M 4
 55
 56// dot_opt_comp_time_m uses a compile-time unrolling factor (Part 1, Q4)
 57double dot_opt_comp_time_m(size_t n, const double *a, const double *b) {
 58
 59  // check if n is divided evenly by the compile-time unrolling factor
 60  if (n % COMP_TIME_M != 0) {
 61    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
 62    return -1;
 63  }
 64
 65  double sum = 0, sums[n / COMP_TIME_M];
 66  memset(sums, 0, sizeof(sums));
 67
 68  for (size_t i=0; i<n; i+=COMP_TIME_M) {
 69    for (size_t j=0; j<COMP_TIME_M; j++) {
 70      sums[j] += a[i+j] * b[i+j];
 71    }
 72  }
 73  for (size_t j=0; j < n/COMP_TIME_M; j++) {
 74    sum += sums[j];
 75  }
 76  return sum;
 77}
 78
 79// dot_opt_run_time_m uses a runtime unrolling factor (Part 1, Q5)
 80double dot_opt_run_time_m(size_t n, const double *a, const double *b, const size_t m) {
 81
 82  // check if n is divided evenly by the compile-time unrolling factor
 83  if (n % m != 0) {
 84    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
 85    return -1;
 86  }
 87
 88  double sum = 0, sums[n / m];
 89
 90  // initialize the dynamically created sums array to zero
 91  memset(sums, 0, sizeof(sums));
 92  // printf("Before for-loop \n");
 93  for (size_t i=0; i<n; i+=m) {
 94    // printf("Inside first for-loop \n");
 95    for (size_t j=0; j < m; j++) {
 96      // printf("Inside 2nd for-loop. I is %d, j is %d \n", i, j);
 97      sums[j] += a[i+j] * b[i+j];
 98    }
 99  }
100  for (size_t j=0; j < n/m; j++) {
101    // printf("Inside the reduction for-loop. j is %d \n", j);
102    sum += sums[j];
103  }
104  return sum;
105}
106// This part ends to answer Part 1
107
108__attribute((noinline))
109static double dot_ref(size_t n, const double *a, const double *b) {
110  double sum = 0;
111  for (size_t i=0; i<n; i++)
112    sum += a[i] * b[i];
113  return sum;
114}
115
116// The following even-odd function given in the assignment tries to alleviate some data dependency from one loop iteration to another.
117// Note: the following function is NOT correct for all values of n. In fact, for n odd we incur in out-of-bound access when we perform the second partial sum sum1 += a[i+1] * b[i+1]
118double dot_opt_even_odd(size_t n, const double *a, const double *b) {
119  double sum0 = 0, sum1 = 0;
120  for (size_t i=0; i<n; i+=2) {
121    sum0 += a[i+0] * b[i+0];
122    sum1 += a[i+1] * b[i+1];
123  }
124  return sum0 + sum1;
125}
126
127__attribute((noinline))
128static double dot_opt1(size_t n, const double *a, const double *b) {
129  double sums[4] = {};
130  omp_set_num_threads(4);
131  #pragma omp parallel
132  {
133    int id = omp_get_thread_num();
134    for (size_t i=id; i<n; i+=4)
135      sums[id] += a[i] * b[i];
136  }
137  for (size_t j=1; j<4; j++) sums[0] += sums[j];
138  return sums[0];
139}
140
141// dot_opt1_simd is the dot_opt1 version to which we added the #pragma omp simd directive
142__attribute((noinline))
143static double dot_opt1_simd(size_t n, const double *a, const double *b) {
144  double sums[4] = {};
145  omp_set_num_threads(4);
146  #pragma omp parallel
147  {
148    int id = omp_get_thread_num();
149    #pragma omp simd
150    for (size_t i=id; i<n; i+=4)
151      sums[id] += a[i] * b[i];
152  }
153  #pragma omp simd
154  for (size_t j=1; j<4; j++) sums[0] += sums[j];
155  return sums[0];
156}
157
158__attribute((noinline))
159static double dot_opt2(size_t n, const double *a, const double *b) {
160  double sums[4] = {};
161  omp_set_num_threads(4);
162  #pragma omp parallel
163  {
164    int id = omp_get_thread_num();
165    #pragma omp for
166    for (size_t i=0; i<n; i++)
167      sums[id] += a[i] * b[i];
168  }
169  for (size_t j=1; j<4; j++) sums[0] += sums[j];
170  return sums[0];
171}
172
173// dot_opt2_simd is the dot_opt2 version to which we added the #pragma omp simd directive
174__attribute((noinline))
175static double dot_opt2_simd(size_t n, const double *a, const double *b) {
176  double sums[4] = {};
177  omp_set_num_threads(4);
178  #pragma omp parallel
179  {
180    int id = omp_get_thread_num();
181    #pragma omp for simd
182    for (size_t i=0; i<n; i++)
183      sums[id] += a[i] * b[i];
184  }
185  #pragma omp simd
186  for (size_t j=1; j<4; j++) sums[0] += sums[j];
187  return sums[0];
188}
189
190__attribute((noinline))
191static double dot_opt3(size_t n, const double *a, const double *b) {
192  double sum = 0;
193  omp_set_num_threads(4);
194  #pragma omp parallel
195  {
196    #pragma omp for reduction(+:sum)
197    for (size_t i=0; i<n; i++)
198      sum += a[i] * b[i];
199  }
200  return sum;
201}
202
203// dot_opt3_simd is the dot_opt3 version to which we added the #pragma omp simd directive
204__attribute((noinline))
205static double dot_opt3_simd(size_t n, const double *a, const double *b) {
206  double sum = 0;
207  omp_set_num_threads(4);
208  #pragma omp parallel
209  {
210    #pragma omp for simd reduction(+:sum)
211    for (size_t i=0; i<n; i++)
212      sum += a[i] * b[i];
213  }
214  return sum;
215}
216
217static void report_dot(const char *name, ticks_t start_ticks, size_t flops, double result) {
218  ticks_t ticks = rdtsc() - start_ticks;
219  double rate = 1.*flops / ticks;
220  if (fabs(result - flops) > 1e-10)
221    printf("Result %f failed to validate with expected value %ld\n", result, flops);
222  printf("%8s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
223}
224
225#define REPORT_DOT(f) do {                                              \
226    for (int rep=0; rep<args.nreps; rep++) {                            \
227      ticks_t ticks_start = rdtsc();                                    \
228      report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b)); \
229    }                                                                   \
230  } while (0)
231
232// The following macro is to test the perfomance of the runtime loop unrolling constant to answer Part 1.5
233#define REPORT_DOT_RUNTIME(f, m) do {                                              \
234  for (int rep=0; rep<args.nreps; rep++) {                            \
235    ticks_t ticks_start = rdtsc();                                    \
236    report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b, m)); \
237  }                                                                   \
238} while (0)
239
240// Dimensions of the matrices for block dot products.
241#define J 8
242#define K 4
243
244// Performs the operation
245//   C = A * B
246// where A and B have shape (J,n) and (n,K) respectively.
247// This reference version stores A as row-major and B as column-major.
248static void bdot_ref(size_t n, const double *a, const double *b, double *c) {
249  for (size_t j=0; j<J; j++) {
250    for (size_t k=0; k<K; k++) {
251      c[j*K+k] = dot_ref(n, &a[j*n], &b[k*n]);
252    }
253  }
254}
255
256// bdot_opt1 internally calls dot_opt1
257static void bdot_opt1(size_t n, const double *a, const double *b, double *c) {
258
259  for (size_t j=0; j<J; j++) {
260    for (size_t k=0; k<K; k++) {
261      c[j*K+k] = dot_opt1(n, &a[j*n], &b[k*n]);
262    }
263  }
264}
265
266// bdot_opt1_simd internally calls dot_opt1_simd
267static void bdot_opt1_simd(size_t n, const double *a, const double *b, double *c) {
268
269  for (size_t j=0; j<J; j++) {
270    for (size_t k=0; k<K; k++) {
271      c[j*K+k] = dot_opt1_simd(n, &a[j*n], &b[k*n]);
272    }
273  }
274}
275
276// bdot_opt2 internally calls dot_opt2
277static void bdot_opt2(size_t n, const double *a, const double *b, double *c) {
278
279  for (size_t j=0; j<J; j++) {
280    for (size_t k=0; k<K; k++) {
281      c[j*K+k] = dot_opt2(n, &a[j*n], &b[k*n]);
282    }
283  }
284}
285
286// bdot_opt2_simd internally calls dot_opt2_simd
287static void bdot_opt2_simd(size_t n, const double *a, const double *b, double *c) {
288
289  for (size_t j=0; j<J; j++) {
290    for (size_t k=0; k<K; k++) {
291      c[j*K+k] = dot_opt2_simd(n, &a[j*n], &b[k*n]);
292    }
293  }
294}
295
296// bdot_opt3 internally calls dot_opt3
297static void bdot_opt3(size_t n, const double *a, const double *b, double *c) {
298
299  for (size_t j=0; j<J; j++) {
300    for (size_t k=0; k<K; k++) {
301      c[j*K+k] = dot_opt3(n, &a[j*n], &b[k*n]);
302    }
303  }
304}
305
306// bdot_opt3_simd internally calls dot_opt3_simd
307static void bdot_opt3_simd(size_t n, const double *a, const double *b, double *c) {
308
309  for (size_t j=0; j<J; j++) {
310    for (size_t k=0; k<K; k++) {
311      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
312    }
313  }
314}
315
316// bdot_opt3_simd_reordered performs the J * K (=32) pairwise products swapping the k,j indices
317static void bdot_opt3_simd_reordered(size_t n, const double *a, const double *b, double *c) {
318
319  for (size_t k=0; k<K; k++) {
320    for (size_t j=0; j<J; j++) {
321      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
322    }
323  }
324}
325
326
327static void init_bdot(size_t n, double *a, size_t ajstride, size_t aistride,
328                      double *b, size_t bistride, size_t bkstride) {
329  for (size_t i=0; i<n; i++) {
330    for (size_t j=0; j<J; j++)
331      a[i*aistride + j*ajstride] = 1000*(i+1) + j+1;
332    for (size_t k=0; k<K; k++)
333      b[i*bistride + k*bkstride] = 1./(1000*(i+1) + k+1);
334  }
335}
336
337static void report_bdot(const char *name, ticks_t start_ticks, size_t flops,
338                        const double *result, int jstride, int kstride,
339                        const double *ref_result) {
340  ticks_t ticks = rdtsc() - start_ticks;
341  double rate = 1.*flops / ticks;
342  if (result && ref_result && result != ref_result) {
343    for (int j=0; j<J; j++) {
344      for (int k=0; k<K; k++) {
345        if (fabs(result[j*jstride + k*kstride] - ref_result[j*K+k]) > 1e-10) {
346          printf("Result[%d,%d] = %f failed to validate with expected value %f\n", j, k, result[j*jstride + k*kstride], ref_result[j*K+k]);
347        }
348      }
349    }
350  }
351  printf("%s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
352}
353
354#define REPORT_BDOT(f, c, jstride, kstride, c_ref) do {                 \
355    for (int rep=0; rep<args.nreps; rep++) {                            \
356      ticks_t ticks_start = rdtsc();                                    \
357      f(args.length, a, b, c);                                          \
358      report_bdot(#f, ticks_start, 2*J*K*args.length, c, jstride, kstride, c_ref); \
359    }                                                                   \
360  } while (0)
361
362int main(int argc, char **argv) {
363  struct Args args;
364  struct argp argp = {options, parse_opt, NULL, NULL};
365  argp_parse(&argp, argc, argv, 0, 0, &args);
366  size_t n = args.length;
367
368  switch (args.block) {
369  case false: {
370    double *a = malloc(n * sizeof(double));
371    double *b = malloc(n * sizeof(double));
372    for (size_t i=0; i<n; i++) {
373      a[i] = 2.*(i+1);
374      b[i] = 1./(i+1);
375    }
376
377    printf("  Name  \tflops\tticks\tflops/tick\n");
378    REPORT_DOT(dot_ref);
379    REPORT_DOT(dot_opt1);
380    REPORT_DOT(dot_opt2);
381    REPORT_DOT(dot_opt3);
382    // To answer Part 1, we will test the excution for the even-odd, and compile/runtime loop unrolling constant versions
383    REPORT_DOT(dot_opt_even_odd);
384    REPORT_DOT(dot_opt_comp_time_m);
385    REPORT_DOT_RUNTIME(dot_opt_run_time_m, args.unroll_factor);
386
387    free(a); free(b);
388  } break;
389  case true: {
390    // Initialize the matrices (as flattened vectors)
391    double *a = malloc(J * n * sizeof(double));
392    double *b = malloc(K * n * sizeof(double));
393    double *c = malloc(J * K * sizeof(double));
394    double *c_ref = malloc(J * K * sizeof(double));
395
396    printf("Name    \tflops\tticks\tflops/tick\n");
397    init_bdot(args.length, a, n, 1, b, 1, n);
398    REPORT_BDOT(bdot_ref, c_ref, K, 1, c_ref);
399
400    // You may initialize a and b differently and call more variants by editing
401    // the two lines below, or by creating new variants.
402    init_bdot(args.length, a, n, 1, b, 1, n);
403    // To answer Part 2, we will test the excution for the three different blocked opt versions
404    REPORT_BDOT(bdot_opt1, c, K, 1, c_ref);
405    REPORT_BDOT(bdot_opt2, c, K, 1, c_ref);
406    REPORT_BDOT(bdot_opt3, c, K, 1, c_ref);
407    REPORT_BDOT(bdot_opt1_simd, c, K, 1, c_ref);
408    REPORT_BDOT(bdot_opt2_simd, c, K, 1, c_ref);
409    REPORT_BDOT(bdot_opt3_simd, c, K, 1, c_ref);
410    REPORT_BDOT(bdot_opt3_simd_reordered, c, K, 1, c_ref);
411
412    free(a); free(b); free(c); free(c_ref);
413  } break;
414  }
415  return 0;
416}