24) HW3 solution#
Part 1: Optimizing dot product (30%)#
(5%) The compile/runtime unrolling factor versions achieve performance similar to the dot_opt2
version, but not as good as the dot_opt3
version.
In addition to your commentary, answer the following questions:
(5%) Is this code correct for all values of n?
No, for odd
n
, we incur in out-of-bound access when we perform the second product,a[i+1] * b[i+1]
.
(5%) Can this code change the numerically computed result?
Yes, because of floating-point aritmethic (FPA). We’re adding an extra operation. If any of the numbers to add were exrtemely small (or large), we will suffer from roundoff errors. Also, as you’ve learned, the C standard does not allow floating point arithmetic to be reordered because it may change the computed values (non associativity).
(5%) Can you extend this technique to increase performance further (longer vector registers and/or more instruction-level parallelism)?
Yes, we can definitely consider instead of an unrolling factor 2, chunking the data in larger sizes and apply SIMD-like operations (like we do in Part 2)
(5%) Can you make the unrolling factor 2 a compile-time constant, perhaps by using an inner loop?
Yes, we can do that (see
dot_opt_comp_time_m
function)
(5%) Could that unrolling factor be a run-time parameter?
Yes, we can do that (see
dot_opt_run_time_m
function). There are a couple of different ways this can be achieved.If someone defined the runtime parameter, in say,
main
, then the compiler would get “smart” and optimize and treat it in the same way as a compile-time constant. Keep in mind that we’re compiling with the optimization flag-O3
, so if we only used that unrolling factor in the call to our function inmain
, then the the compiler knows at compile time the value that we’re passing, and because we’re calling that function only once, it might just substitute the runtime constant as it were a constant known at compile timeBut if someone actually added it as a true runtime unrolling factor, passed via a command-line argument, which the compiler cannot know in advance, then, yes, the performance of the runtime version would be lower than the compile-time counterpart.
! gcc -O3 -march=native -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000
Name flops ticks flops/tick
dot_ref 20000 119592 0.17
dot_ref 20000 118928 0.17
dot_ref 20000 118836 0.17
dot_ref 20000 118546 0.17
dot_ref 20000 118776 0.17
dot_ref 20000 118562 0.17
dot_ref 20000 118448 0.17
dot_ref 20000 118484 0.17
dot_ref 20000 118360 0.17
dot_ref 20000 118412 0.17
dot_opt1 20000 1016736 0.02
dot_opt1 20000 52460 0.38
dot_opt1 20000 46042 0.43
dot_opt1 20000 45026 0.44
dot_opt1 20000 44474 0.45
dot_opt1 20000 44284 0.45
dot_opt1 20000 44320 0.45
dot_opt1 20000 44260 0.45
dot_opt1 20000 43464 0.46
dot_opt1 20000 44458 0.45
dot_opt2 20000 46100 0.43
dot_opt2 20000 45326 0.44
dot_opt2 20000 44222 0.45
dot_opt2 20000 44602 0.45
dot_opt2 20000 43978 0.45
dot_opt2 20000 46296 0.43
dot_opt2 20000 44100 0.45
dot_opt2 20000 46328 0.43
dot_opt2 20000 44364 0.45
dot_opt2 20000 44884 0.45
dot_opt3 20000 43628 0.46
dot_opt3 20000 41364 0.48
dot_opt3 20000 42590 0.47
dot_opt3 20000 40096 0.50
dot_opt3 20000 41600 0.48
dot_opt3 20000 40272 0.50
dot_opt3 20000 42280 0.47
dot_opt3 20000 41864 0.48
dot_opt3 20000 40860 0.49
dot_opt3 20000 40044 0.50
dot_opt_even_odd 20000 64004 0.31
dot_opt_even_odd 20000 64466 0.31
dot_opt_even_odd 20000 114844 0.17
dot_opt_even_odd 20000 66252 0.30
dot_opt_even_odd 20000 64298 0.31
dot_opt_even_odd 20000 64574 0.31
dot_opt_even_odd 20000 64980 0.31
dot_opt_even_odd 20000 64512 0.31
dot_opt_even_odd 20000 64618 0.31
dot_opt_even_odd 20000 64618 0.31
dot_opt_comp_time_m 20000 118654 0.17
dot_opt_comp_time_m 20000 68176 0.29
dot_opt_comp_time_m 20000 67624 0.30
dot_opt_comp_time_m 20000 67250 0.30
dot_opt_comp_time_m 20000 67262 0.30
dot_opt_comp_time_m 20000 67478 0.30
dot_opt_comp_time_m 20000 67544 0.30
dot_opt_comp_time_m 20000 67570 0.30
dot_opt_comp_time_m 20000 67430 0.30
dot_opt_comp_time_m 20000 67416 0.30
dot_opt_run_time_m 20000 111820 0.18
dot_opt_run_time_m 20000 110708 0.18
dot_opt_run_time_m 20000 110612 0.18
dot_opt_run_time_m 20000 110440 0.18
dot_opt_run_time_m 20000 110112 0.18
dot_opt_run_time_m 20000 153508 0.13
dot_opt_run_time_m 20000 111034 0.18
dot_opt_run_time_m 20000 110758 0.18
dot_opt_run_time_m 20000 110572 0.18
dot_opt_run_time_m 20000 110360 0.18
Part 2: Optimizing block inner product (%70)#
What does it mean to reorder loops? Will it help or hurt performance?
The two lines that originally use the j
as slowest index and k
as fastest index to perform the pair-wise products:
for (size_t j=0; j<J; j++) {
for (size_t k=0; k<K; k++) {
tell us that the same row of the matrix \(A^T\) is multiplied by all columns of \(B\), and we store the result in \(C\) row-wise.
If we swapped these two lines, (see the bdot_opt3_simd_reordered
function where we do this), we would multiply the same column of \(B\) by all rows of \(A^T\) and we’d store the result in \(C\) column-wise.
The effect of doing so it is actually negligible on performance (on my architecture) and might bring only a very minimal advantage on some other architectures. If you think about it, this would only change how 32 pair-wise dot products are performed, but the majority of the work in this code is done in the single pair-wise dot products (20000 FLOPs each).
Does it help to change the layout in memory (see the
aistride
,ajstride
,bistride
, andbkstride
parameters in the code)?
No, doing it without further algorithmic changes in the functions where we actually perform the dot products, does not help. First of all, the only way the stride parameters can be changed is in the call of the init_bdot
function. Instead of calling init_bdot
as we called it originally:
init_bdot(args.length, a, n, 1, b, 1, n);
one of the two arguments between n
and 1
has to be kept to 1
to fill the whole matrices A
and B
without any gaps, as in:
init_bdot(args.length, a, 1, n, b, n, 1);
This would make A
be stored column-wise and B
stored row-wise (effectively transposing the matrices). If we did this, in order for the algorithm to still produce the same results of the pairwise dot products, we would also need to perform further algorithmic changes. Even if we did perform those further algorithmic changes to make the results correct, it would not help performance, as we would be back to the original situation.
One might have also thought to use smaller blocks/stride patterns, such as 1, J
and 1, K
. However, again, this would only change how the data entries are laid out, and without further algorithmic changes in the functions that actually call the pair-wise dot products (which are implemented to perform the dot product of the entire vectors), it would not be beneficial.
Try using the
#pragma omp simd
directive seen in class and the compiler option-fopenmp-simd
.
See all the bdot_opt*_simd
versions and related results (where *
is 1,2,3). The addition of the #pragma omp simd
directive only significantly helps the bdot_opt3
optimized version, and leaves the bdot_opt1
and bdot_opt2
results essentially unaltered.
! gcc -O3 -march=native -fopenmp-simd -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000 -b
Name flops ticks flops/tick
bdot_ref 640000 3743062 0.17
bdot_ref 640000 3746338 0.17
bdot_ref 640000 3713740 0.17
bdot_ref 640000 3782576 0.17
bdot_ref 640000 3710182 0.17
bdot_ref 640000 3733034 0.17
bdot_ref 640000 3776882 0.17
bdot_ref 640000 3732790 0.17
bdot_ref 640000 3707136 0.17
bdot_ref 640000 3712336 0.17
bdot_opt1 640000 2406724 0.27
bdot_opt1 640000 1503544 0.43
bdot_opt1 640000 1401484 0.46
bdot_opt1 640000 1348650 0.47
bdot_opt1 640000 1387612 0.46
bdot_opt1 640000 1391460 0.46
bdot_opt1 640000 1613596 0.40
bdot_opt1 640000 1479270 0.43
bdot_opt1 640000 1362592 0.47
bdot_opt1 640000 1391172 0.46
bdot_opt2 640000 1362786 0.47
bdot_opt2 640000 1380144 0.46
bdot_opt2 640000 1371448 0.47
bdot_opt2 640000 1385212 0.46
bdot_opt2 640000 1361832 0.47
bdot_opt2 640000 1386840 0.46
bdot_opt2 640000 1363798 0.47
bdot_opt2 640000 1378826 0.46
bdot_opt2 640000 1364462 0.47
bdot_opt2 640000 1367900 0.47
bdot_opt3 640000 1348248 0.47
bdot_opt3 640000 1312500 0.49
bdot_opt3 640000 1336318 0.48
bdot_opt3 640000 1319916 0.48
bdot_opt3 640000 1343640 0.48
bdot_opt3 640000 1321822 0.48
bdot_opt3 640000 1348920 0.47
bdot_opt3 640000 1322138 0.48
bdot_opt3 640000 1332408 0.48
bdot_opt3 640000 1317314 0.49
bdot_opt1_simd 640000 1387644 0.46
bdot_opt1_simd 640000 1363788 0.47
bdot_opt1_simd 640000 1469780 0.44
bdot_opt1_simd 640000 1318188 0.49
bdot_opt1_simd 640000 1330876 0.48
bdot_opt1_simd 640000 1365258 0.47
bdot_opt1_simd 640000 1320088 0.48
bdot_opt1_simd 640000 1416358 0.45
bdot_opt1_simd 640000 1320894 0.48
bdot_opt1_simd 640000 1342492 0.48
bdot_opt2_simd 640000 1308918 0.49
bdot_opt2_simd 640000 1340988 0.48
bdot_opt2_simd 640000 1306204 0.49
bdot_opt2_simd 640000 1357556 0.47
bdot_opt2_simd 640000 1306916 0.49
bdot_opt2_simd 640000 1337742 0.48
bdot_opt2_simd 640000 1311456 0.49
bdot_opt2_simd 640000 1337606 0.48
bdot_opt2_simd 640000 1303892 0.49
bdot_opt2_simd 640000 1325636 0.48
bdot_opt3_simd 640000 561452 1.14
bdot_opt3_simd 640000 557208 1.15
bdot_opt3_simd 640000 581866 1.10
bdot_opt3_simd 640000 560198 1.14
bdot_opt3_simd 640000 552394 1.16
bdot_opt3_simd 640000 554614 1.15
bdot_opt3_simd 640000 585024 1.09
bdot_opt3_simd 640000 561434 1.14
bdot_opt3_simd 640000 565898 1.13
bdot_opt3_simd 640000 556436 1.15
bdot_opt3_simd_reordered 640000 561168 1.14
bdot_opt3_simd_reordered 640000 587682 1.09
bdot_opt3_simd_reordered 640000 561772 1.14
bdot_opt3_simd_reordered 640000 556896 1.15
bdot_opt3_simd_reordered 640000 557428 1.15
bdot_opt3_simd_reordered 640000 597996 1.07
bdot_opt3_simd_reordered 640000 552732 1.16
bdot_opt3_simd_reordered 640000 555214 1.15
bdot_opt3_simd_reordered 640000 579458 1.10
bdot_opt3_simd_reordered 640000 596628 1.07
Code:#
1#include "rdtsc.h"
2
3#include <argp.h>
4#include <math.h>
5#include <omp.h>
6#include <stdbool.h>
7#include <stdio.h>
8#include <stdlib.h>
9#include <string.h>
10
11struct Args {
12 size_t length;
13 size_t nreps;
14 bool block;
15 size_t unroll_factor;
16};
17
18static struct argp_option options[] = {
19 {"length", 'n', "size_t", 0, "Length of each vector"},
20 {"nreps", 'r', "size_t", 0, "Number of repetitions"},
21 {"block", 'b', NULL, 0, "Compute block dot products (versus a single dot product)"},
22 {"unroll_factor", 'm', "size_t", 0, "Runtime unrolling factor for dot product"},
23};
24
25static error_t parse_opt (int key, char *arg, struct argp_state *state)
26{
27 struct Args *args = state->input;
28 switch (key) {
29 case ARGP_KEY_INIT:
30 args->length = 100;
31 args->nreps = 10;
32 args->block = false;
33 args->unroll_factor = 4;
34 break;
35 case 'n':
36 args->length = strtol(arg, NULL, 10);
37 break;
38 case 'r':
39 args->nreps = strtol(arg, NULL, 10);
40 break;
41 case 'b':
42 args->block = true;
43 break;
44 case 'm':
45 args->unroll_factor = strtol(arg, NULL, 10);
46 break;
47 default:
48 return ARGP_ERR_UNKNOWN;
49 }
50 return 0;
51}
52
53// This part is to answer Part 1
54#define COMP_TIME_M 4
55
56// dot_opt_comp_time_m uses a compile-time unrolling factor (Part 1, Q4)
57double dot_opt_comp_time_m(size_t n, const double *a, const double *b) {
58
59 // check if n is divided evenly by the compile-time unrolling factor
60 if (n % COMP_TIME_M != 0) {
61 printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
62 return -1;
63 }
64
65 double sum = 0, sums[n / COMP_TIME_M];
66 memset(sums, 0, sizeof(sums));
67
68 for (size_t i=0; i<n; i+=COMP_TIME_M) {
69 for (size_t j=0; j<COMP_TIME_M; j++) {
70 sums[j] += a[i+j] * b[i+j];
71 }
72 }
73 for (size_t j=0; j < n/COMP_TIME_M; j++) {
74 sum += sums[j];
75 }
76 return sum;
77}
78
79// dot_opt_run_time_m uses a runtime unrolling factor (Part 1, Q5)
80double dot_opt_run_time_m(size_t n, const double *a, const double *b, const size_t m) {
81
82 // check if n is divided evenly by the compile-time unrolling factor
83 if (n % m != 0) {
84 printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
85 return -1;
86 }
87
88 double sum = 0, sums[n / m];
89
90 // initialize the dynamically created sums array to zero
91 memset(sums, 0, sizeof(sums));
92 // printf("Before for-loop \n");
93 for (size_t i=0; i<n; i+=m) {
94 // printf("Inside first for-loop \n");
95 for (size_t j=0; j < m; j++) {
96 // printf("Inside 2nd for-loop. I is %d, j is %d \n", i, j);
97 sums[j] += a[i+j] * b[i+j];
98 }
99 }
100 for (size_t j=0; j < n/m; j++) {
101 // printf("Inside the reduction for-loop. j is %d \n", j);
102 sum += sums[j];
103 }
104 return sum;
105}
106// This part ends to answer Part 1
107
108__attribute((noinline))
109static double dot_ref(size_t n, const double *a, const double *b) {
110 double sum = 0;
111 for (size_t i=0; i<n; i++)
112 sum += a[i] * b[i];
113 return sum;
114}
115
116// The following even-odd function given in the assignment tries to alleviate some data dependency from one loop iteration to another.
117// Note: the following function is NOT correct for all values of n. In fact, for n odd we incur in out-of-bound access when we perform the second partial sum sum1 += a[i+1] * b[i+1]
118double dot_opt_even_odd(size_t n, const double *a, const double *b) {
119 double sum0 = 0, sum1 = 0;
120 for (size_t i=0; i<n; i+=2) {
121 sum0 += a[i+0] * b[i+0];
122 sum1 += a[i+1] * b[i+1];
123 }
124 return sum0 + sum1;
125}
126
127__attribute((noinline))
128static double dot_opt1(size_t n, const double *a, const double *b) {
129 double sums[4] = {};
130 omp_set_num_threads(4);
131 #pragma omp parallel
132 {
133 int id = omp_get_thread_num();
134 for (size_t i=id; i<n; i+=4)
135 sums[id] += a[i] * b[i];
136 }
137 for (size_t j=1; j<4; j++) sums[0] += sums[j];
138 return sums[0];
139}
140
141// dot_opt1_simd is the dot_opt1 version to which we added the #pragma omp simd directive
142__attribute((noinline))
143static double dot_opt1_simd(size_t n, const double *a, const double *b) {
144 double sums[4] = {};
145 omp_set_num_threads(4);
146 #pragma omp parallel
147 {
148 int id = omp_get_thread_num();
149 #pragma omp simd
150 for (size_t i=id; i<n; i+=4)
151 sums[id] += a[i] * b[i];
152 }
153 #pragma omp simd
154 for (size_t j=1; j<4; j++) sums[0] += sums[j];
155 return sums[0];
156}
157
158__attribute((noinline))
159static double dot_opt2(size_t n, const double *a, const double *b) {
160 double sums[4] = {};
161 omp_set_num_threads(4);
162 #pragma omp parallel
163 {
164 int id = omp_get_thread_num();
165 #pragma omp for
166 for (size_t i=0; i<n; i++)
167 sums[id] += a[i] * b[i];
168 }
169 for (size_t j=1; j<4; j++) sums[0] += sums[j];
170 return sums[0];
171}
172
173// dot_opt2_simd is the dot_opt2 version to which we added the #pragma omp simd directive
174__attribute((noinline))
175static double dot_opt2_simd(size_t n, const double *a, const double *b) {
176 double sums[4] = {};
177 omp_set_num_threads(4);
178 #pragma omp parallel
179 {
180 int id = omp_get_thread_num();
181 #pragma omp for simd
182 for (size_t i=0; i<n; i++)
183 sums[id] += a[i] * b[i];
184 }
185 #pragma omp simd
186 for (size_t j=1; j<4; j++) sums[0] += sums[j];
187 return sums[0];
188}
189
190__attribute((noinline))
191static double dot_opt3(size_t n, const double *a, const double *b) {
192 double sum = 0;
193 omp_set_num_threads(4);
194 #pragma omp parallel
195 {
196 #pragma omp for reduction(+:sum)
197 for (size_t i=0; i<n; i++)
198 sum += a[i] * b[i];
199 }
200 return sum;
201}
202
203// dot_opt3_simd is the dot_opt3 version to which we added the #pragma omp simd directive
204__attribute((noinline))
205static double dot_opt3_simd(size_t n, const double *a, const double *b) {
206 double sum = 0;
207 omp_set_num_threads(4);
208 #pragma omp parallel
209 {
210 #pragma omp for simd reduction(+:sum)
211 for (size_t i=0; i<n; i++)
212 sum += a[i] * b[i];
213 }
214 return sum;
215}
216
217static void report_dot(const char *name, ticks_t start_ticks, size_t flops, double result) {
218 ticks_t ticks = rdtsc() - start_ticks;
219 double rate = 1.*flops / ticks;
220 if (fabs(result - flops) > 1e-10)
221 printf("Result %f failed to validate with expected value %ld\n", result, flops);
222 printf("%8s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
223}
224
225#define REPORT_DOT(f) do { \
226 for (int rep=0; rep<args.nreps; rep++) { \
227 ticks_t ticks_start = rdtsc(); \
228 report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b)); \
229 } \
230 } while (0)
231
232// The following macro is to test the perfomance of the runtime loop unrolling constant to answer Part 1.5
233#define REPORT_DOT_RUNTIME(f, m) do { \
234 for (int rep=0; rep<args.nreps; rep++) { \
235 ticks_t ticks_start = rdtsc(); \
236 report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b, m)); \
237 } \
238} while (0)
239
240// Dimensions of the matrices for block dot products.
241#define J 8
242#define K 4
243
244// Performs the operation
245// C = A * B
246// where A and B have shape (J,n) and (n,K) respectively.
247// This reference version stores A as row-major and B as column-major.
248static void bdot_ref(size_t n, const double *a, const double *b, double *c) {
249 for (size_t j=0; j<J; j++) {
250 for (size_t k=0; k<K; k++) {
251 c[j*K+k] = dot_ref(n, &a[j*n], &b[k*n]);
252 }
253 }
254}
255
256// bdot_opt1 internally calls dot_opt1
257static void bdot_opt1(size_t n, const double *a, const double *b, double *c) {
258
259 for (size_t j=0; j<J; j++) {
260 for (size_t k=0; k<K; k++) {
261 c[j*K+k] = dot_opt1(n, &a[j*n], &b[k*n]);
262 }
263 }
264}
265
266// bdot_opt1_simd internally calls dot_opt1_simd
267static void bdot_opt1_simd(size_t n, const double *a, const double *b, double *c) {
268
269 for (size_t j=0; j<J; j++) {
270 for (size_t k=0; k<K; k++) {
271 c[j*K+k] = dot_opt1_simd(n, &a[j*n], &b[k*n]);
272 }
273 }
274}
275
276// bdot_opt2 internally calls dot_opt2
277static void bdot_opt2(size_t n, const double *a, const double *b, double *c) {
278
279 for (size_t j=0; j<J; j++) {
280 for (size_t k=0; k<K; k++) {
281 c[j*K+k] = dot_opt2(n, &a[j*n], &b[k*n]);
282 }
283 }
284}
285
286// bdot_opt2_simd internally calls dot_opt2_simd
287static void bdot_opt2_simd(size_t n, const double *a, const double *b, double *c) {
288
289 for (size_t j=0; j<J; j++) {
290 for (size_t k=0; k<K; k++) {
291 c[j*K+k] = dot_opt2_simd(n, &a[j*n], &b[k*n]);
292 }
293 }
294}
295
296// bdot_opt3 internally calls dot_opt3
297static void bdot_opt3(size_t n, const double *a, const double *b, double *c) {
298
299 for (size_t j=0; j<J; j++) {
300 for (size_t k=0; k<K; k++) {
301 c[j*K+k] = dot_opt3(n, &a[j*n], &b[k*n]);
302 }
303 }
304}
305
306// bdot_opt3_simd internally calls dot_opt3_simd
307static void bdot_opt3_simd(size_t n, const double *a, const double *b, double *c) {
308
309 for (size_t j=0; j<J; j++) {
310 for (size_t k=0; k<K; k++) {
311 c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
312 }
313 }
314}
315
316// bdot_opt3_simd_reordered performs the J * K (=32) pairwise products swapping the k,j indices
317static void bdot_opt3_simd_reordered(size_t n, const double *a, const double *b, double *c) {
318
319 for (size_t k=0; k<K; k++) {
320 for (size_t j=0; j<J; j++) {
321 c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
322 }
323 }
324}
325
326
327static void init_bdot(size_t n, double *a, size_t ajstride, size_t aistride,
328 double *b, size_t bistride, size_t bkstride) {
329 for (size_t i=0; i<n; i++) {
330 for (size_t j=0; j<J; j++)
331 a[i*aistride + j*ajstride] = 1000*(i+1) + j+1;
332 for (size_t k=0; k<K; k++)
333 b[i*bistride + k*bkstride] = 1./(1000*(i+1) + k+1);
334 }
335}
336
337static void report_bdot(const char *name, ticks_t start_ticks, size_t flops,
338 const double *result, int jstride, int kstride,
339 const double *ref_result) {
340 ticks_t ticks = rdtsc() - start_ticks;
341 double rate = 1.*flops / ticks;
342 if (result && ref_result && result != ref_result) {
343 for (int j=0; j<J; j++) {
344 for (int k=0; k<K; k++) {
345 if (fabs(result[j*jstride + k*kstride] - ref_result[j*K+k]) > 1e-10) {
346 printf("Result[%d,%d] = %f failed to validate with expected value %f\n", j, k, result[j*jstride + k*kstride], ref_result[j*K+k]);
347 }
348 }
349 }
350 }
351 printf("%s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
352}
353
354#define REPORT_BDOT(f, c, jstride, kstride, c_ref) do { \
355 for (int rep=0; rep<args.nreps; rep++) { \
356 ticks_t ticks_start = rdtsc(); \
357 f(args.length, a, b, c); \
358 report_bdot(#f, ticks_start, 2*J*K*args.length, c, jstride, kstride, c_ref); \
359 } \
360 } while (0)
361
362int main(int argc, char **argv) {
363 struct Args args;
364 struct argp argp = {options, parse_opt, NULL, NULL};
365 argp_parse(&argp, argc, argv, 0, 0, &args);
366 size_t n = args.length;
367
368 switch (args.block) {
369 case false: {
370 double *a = malloc(n * sizeof(double));
371 double *b = malloc(n * sizeof(double));
372 for (size_t i=0; i<n; i++) {
373 a[i] = 2.*(i+1);
374 b[i] = 1./(i+1);
375 }
376
377 printf(" Name \tflops\tticks\tflops/tick\n");
378 REPORT_DOT(dot_ref);
379 REPORT_DOT(dot_opt1);
380 REPORT_DOT(dot_opt2);
381 REPORT_DOT(dot_opt3);
382 // To answer Part 1, we will test the excution for the even-odd, and compile/runtime loop unrolling constant versions
383 REPORT_DOT(dot_opt_even_odd);
384 REPORT_DOT(dot_opt_comp_time_m);
385 REPORT_DOT_RUNTIME(dot_opt_run_time_m, args.unroll_factor);
386
387 free(a); free(b);
388 } break;
389 case true: {
390 // Initialize the matrices (as flattened vectors)
391 double *a = malloc(J * n * sizeof(double));
392 double *b = malloc(K * n * sizeof(double));
393 double *c = malloc(J * K * sizeof(double));
394 double *c_ref = malloc(J * K * sizeof(double));
395
396 printf("Name \tflops\tticks\tflops/tick\n");
397 init_bdot(args.length, a, n, 1, b, 1, n);
398 REPORT_BDOT(bdot_ref, c_ref, K, 1, c_ref);
399
400 // You may initialize a and b differently and call more variants by editing
401 // the two lines below, or by creating new variants.
402 init_bdot(args.length, a, n, 1, b, 1, n);
403 // To answer Part 2, we will test the excution for the three different blocked opt versions
404 REPORT_BDOT(bdot_opt1, c, K, 1, c_ref);
405 REPORT_BDOT(bdot_opt2, c, K, 1, c_ref);
406 REPORT_BDOT(bdot_opt3, c, K, 1, c_ref);
407 REPORT_BDOT(bdot_opt1_simd, c, K, 1, c_ref);
408 REPORT_BDOT(bdot_opt2_simd, c, K, 1, c_ref);
409 REPORT_BDOT(bdot_opt3_simd, c, K, 1, c_ref);
410 REPORT_BDOT(bdot_opt3_simd_reordered, c, K, 1, c_ref);
411
412 free(a); free(b); free(c); free(c_ref);
413 } break;
414 }
415 return 0;
416}