24) HW3 solution#
Part 1: Optimizing dot product (30%)#
(5%) The compile/runtime unrolling factor versions achieve performance similar to the dot_opt2
version, but not as good as the dot_opt3
version.
In addition to your commentary, answer the following questions:
(5%) Is this code correct for all values of n?
No, for odd
n
, we incur in out-of-bound access when we perform the second product,a[i+1] * b[i+1]
.
(5%) Can this code change the numerically computed result?
Yes, because of floating-point aritmethic (FPA). We’re adding an extra operation. If any of the numbers to add were exrtemely small (or large), we will suffer from roundoff errors. Also, as you’ve learned, the C standard does not allow floating point arithmetic to be reordered because it may change the computed values (non associativity).
(5%) Can you extend this technique to increase performance further (longer vector registers and/or more instruction-level parallelism)?
Yes, we can definitely consider instead of an unrolling factor 2, chunking the data in larger sizes and apply SIMD-like operations (like we do in Part 2)
(5%) Can you make the unrolling factor 2 a compile-time constant, perhaps by using an inner loop?
Yes, we can do that (see
dot_opt_comp_time_m
function)
(5%) Could that unrolling factor be a run-time parameter?
Yes, we can do that (see
dot_opt_run_time_m
function). There are a couple of different ways this can be achieved.If someone defined the runtime parameter, in say,
main
, then the compiler would get “smart” and optimize and treat it in the same way as a compile-time constant. Keep in mind that we’re compiling with the optimization flag-O3
, so if we only used that unrolling factor in the call to our function inmain
, then the the compiler knows at compile time the value that we’re passing, and because we’re calling that function only once, it might just substitute the runtime constant as it were a constant known at compile timeBut if someone actually added it as a true runtime unrolling factor, passed via a command-line argument, which the compiler cannot know in advance, then, yes, the performance of the runtime version would be lower than the compile-time counterpart.
! gcc -O3 -march=native -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000
Name flops ticks flops/tick
dot_ref 20000 57800 0.35
dot_ref 20000 57616 0.35
dot_ref 20000 57626 0.35
dot_ref 20000 57632 0.35
dot_ref 20000 57564 0.35
dot_ref 20000 57562 0.35
dot_ref 20000 57614 0.35
dot_ref 20000 57634 0.35
dot_ref 20000 57560 0.35
dot_ref 20000 57548 0.35
dot_opt1 20000 1757938 0.01
dot_opt1 20000 581676 0.03
dot_opt1 20000 696132 0.03
dot_opt1 20000 885786 0.02
dot_opt1 20000 865362 0.02
dot_opt1 20000 787246 0.03
dot_opt1 20000 599558 0.03
dot_opt1 20000 63432 0.32
dot_opt1 20000 67446 0.30
dot_opt1 20000 581704 0.03
dot_opt2 20000 1080788 0.02
dot_opt2 20000 874322 0.02
dot_opt2 20000 610828 0.03
dot_opt2 20000 1017028 0.02
dot_opt2 20000 885956 0.02
dot_opt2 20000 766450 0.03
dot_opt2 20000 991940 0.02
dot_opt2 20000 1187360 0.02
dot_opt2 20000 82834 0.24
dot_opt2 20000 71780 0.28
dot_opt3 20000 112104 0.18
dot_opt3 20000 667324 0.03
dot_opt3 20000 486988 0.04
dot_opt3 20000 651768 0.03
dot_opt3 20000 611812 0.03
dot_opt3 20000 658816 0.03
dot_opt3 20000 521840 0.04
dot_opt3 20000 587960 0.03
dot_opt3 20000 763112 0.03
dot_opt3 20000 79924 0.25
dot_opt_even_odd 20000 44448 0.45
dot_opt_even_odd 20000 43978 0.45
dot_opt_even_odd 20000 44952 0.44
dot_opt_even_odd 20000 44940 0.45
dot_opt_even_odd 20000 44824 0.45
dot_opt_even_odd 20000 44940 0.45
dot_opt_even_odd 20000 44444 0.45
dot_opt_even_odd 20000 44510 0.45
dot_opt_even_odd 20000 44020 0.45
dot_opt_even_odd 20000 44188 0.45
dot_opt_comp_time_m 20000 96980 0.21
dot_opt_comp_time_m 20000 34592 0.58
dot_opt_comp_time_m 20000 34528 0.58
dot_opt_comp_time_m 20000 34572 0.58
dot_opt_comp_time_m 20000 34418 0.58
dot_opt_comp_time_m 20000 34344 0.58
dot_opt_comp_time_m 20000 34192 0.58
dot_opt_comp_time_m 20000 34320 0.58
dot_opt_comp_time_m 20000 34184 0.59
dot_opt_comp_time_m 20000 34284 0.58
dot_opt_run_time_m 20000 100080 0.20
dot_opt_run_time_m 20000 99796 0.20
dot_opt_run_time_m 20000 99760 0.20
dot_opt_run_time_m 20000 99720 0.20
dot_opt_run_time_m 20000 99724 0.20
dot_opt_run_time_m 20000 99696 0.20
dot_opt_run_time_m 20000 99716 0.20
dot_opt_run_time_m 20000 99740 0.20
dot_opt_run_time_m 20000 99700 0.20
dot_opt_run_time_m 20000 99704 0.20
Part 2: Optimizing block inner product (%70)#
What does it mean to reorder loops? Will it help or hurt performance?
The two lines that originally use the j
as slowest index and k
as fastest index to perform the pair-wise products:
for (size_t j=0; j<J; j++) {
for (size_t k=0; k<K; k++) {
tell us that the same row of the matrix \(A^T\) is multiplied by all columns of \(B\), and we store the result in \(C\) row-wise.
If we swapped these two lines, (see the bdot_opt3_simd_reordered
function where we do this), we would multiply the same column of \(B\) by all rows of \(A^T\) and we’d store the result in \(C\) column-wise.
The effect of doing so it is actually negligible on performance (on my architecture) and might bring only a very minimal advantage on some other architectures. If you think about it, this would only change how 32 pair-wise dot products are performed, but the majority of the work in this code is done in the single pair-wise dot products (20000 FLOPs each).
Does it help to change the layout in memory (see the
aistride
,ajstride
,bistride
, andbkstride
parameters in the code)?
No, doing it without further algorithmic changes in the functions where we actually perform the dot products, does not help. First of all, the only way the stride parameters can be changed is in the call of the init_bdot
function. Instead of calling init_bdot
as we called it originally:
init_bdot(args.length, a, n, 1, b, 1, n);
one of the two arguments between n
and 1
has to be kept to 1
to fill the whole matrices A
and B
without any gaps, as in:
init_bdot(args.length, a, 1, n, b, n, 1);
This would make A
be stored column-wise and B
stored row-wise (effectively transposing the matrices). If we did this, in order for the algorithm to still produce the same results of the pairwise dot products, we would also need to perform further algorithmic changes. Even if we did perform those further algorithmic changes to make the results correct, it would not help performance, as we would be back to the original situation.
One might have also thought to use smaller blocks/stride patterns, such as 1, J
and 1, K
. However, again, this would only change how the data entries are laid out, and without further algorithmic changes in the functions that actually call the pair-wise dot products (which are implemented to perform the dot product of the entire vectors), it would not be beneficial.
Try using the
#pragma omp simd
directive seen in class and the compiler option-fopenmp-simd
.
See all the bdot_opt*_simd
versions and related results (where *
is 1,2,3). The addition of the #pragma omp simd
directive only significantly helps the bdot_opt3
optimized version, and leaves the bdot_opt1
and bdot_opt2
results essentially unaltered.
! gcc -O3 -march=native -fopenmp-simd -fopenmp ../c_codes/hw3_solution/dot.c -o dot
! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000 -b
Name flops ticks flops/tick
bdot_ref 640000 1553330 0.41
bdot_ref 640000 1482816 0.43
bdot_ref 640000 1473188 0.43
bdot_ref 640000 2491487 0.26
bdot_ref 640000 984068 0.65
bdot_ref 640000 1002864 0.64
bdot_ref 640000 980176 0.65
bdot_ref 640000 979916 0.65
bdot_ref 640000 1002372 0.64
bdot_ref 640000 983882 0.65
bdot_opt1 640000 2992558 0.21
bdot_opt1 640000 1881600 0.34
bdot_opt1 640000 1447478 0.44
bdot_opt1 640000 1534196 0.42
bdot_opt1 640000 1449300 0.44
bdot_opt1 640000 1481620 0.43
bdot_opt1 640000 1558464 0.41
bdot_opt1 640000 1465784 0.44
bdot_opt1 640000 1427708 0.45
bdot_opt1 640000 2893732 0.22
bdot_opt2 640000 3470022 0.18
bdot_opt2 640000 2293348 0.28
bdot_opt2 640000 2234964 0.29
bdot_opt2 640000 2242024 0.29
bdot_opt2 640000 2182178 0.29
bdot_opt2 640000 2213002 0.29
bdot_opt2 640000 2189554 0.29
bdot_opt2 640000 2258316 0.28
bdot_opt2 640000 2259706 0.28
bdot_opt2 640000 2250576 0.28
bdot_opt3 640000 2737568 0.23
bdot_opt3 640000 1605498 0.40
bdot_opt3 640000 1600662 0.40
bdot_opt3 640000 1587474 0.40
bdot_opt3 640000 1579622 0.41
bdot_opt3 640000 1580854 0.40
bdot_opt3 640000 1587928 0.40
bdot_opt3 640000 1795260 0.36
bdot_opt3 640000 1656106 0.39
bdot_opt3 640000 1523864 0.42
bdot_opt1_simd 640000 1655884 0.39
bdot_opt1_simd 640000 1654972 0.39
bdot_opt1_simd 640000 1656084 0.39
bdot_opt1_simd 640000 1657640 0.39
bdot_opt1_simd 640000 1624626 0.39
bdot_opt1_simd 640000 1581086 0.40
bdot_opt1_simd 640000 1623642 0.39
bdot_opt1_simd 640000 1671630 0.38
bdot_opt1_simd 640000 1575028 0.41
bdot_opt1_simd 640000 1586240 0.40
bdot_opt2_simd 640000 2220036 0.29
bdot_opt2_simd 640000 2211386 0.29
bdot_opt2_simd 640000 2267582 0.28
bdot_opt2_simd 640000 2267962 0.28
bdot_opt2_simd 640000 2270632 0.28
bdot_opt2_simd 640000 2211588 0.29
bdot_opt2_simd 640000 2207122 0.29
bdot_opt2_simd 640000 2232456 0.29
bdot_opt2_simd 640000 2240136 0.29
bdot_opt2_simd 640000 2225774 0.29
bdot_opt3_simd 640000 1489792 0.43
bdot_opt3_simd 640000 1467498 0.44
bdot_opt3_simd 640000 1378286 0.46
bdot_opt3_simd 640000 1399970 0.46
bdot_opt3_simd 640000 1330830 0.48
bdot_opt3_simd 640000 1436224 0.45
bdot_opt3_simd 640000 1444604 0.44
bdot_opt3_simd 640000 1393742 0.46
bdot_opt3_simd 640000 1316022 0.49
bdot_opt3_simd 640000 1350594 0.47
bdot_opt3_simd_reordered 640000 1312090 0.49
bdot_opt3_simd_reordered 640000 1354006 0.47
bdot_opt3_simd_reordered 640000 1335878 0.48
bdot_opt3_simd_reordered 640000 1303826 0.49
bdot_opt3_simd_reordered 640000 1390632 0.46
bdot_opt3_simd_reordered 640000 1333384 0.48
bdot_opt3_simd_reordered 640000 1371484 0.47
bdot_opt3_simd_reordered 640000 1316464 0.49
bdot_opt3_simd_reordered 640000 1303554 0.49
bdot_opt3_simd_reordered 640000 1357476 0.47
Code:#
1#include "rdtsc.h"
2
3#include <argp.h>
4#include <math.h>
5#include <omp.h>
6#include <stdbool.h>
7#include <stdio.h>
8#include <stdlib.h>
9#include <string.h>
10
11struct Args {
12 size_t length;
13 size_t nreps;
14 bool block;
15 size_t unroll_factor;
16};
17
18static struct argp_option options[] = {
19 {"length", 'n', "size_t", 0, "Length of each vector"},
20 {"nreps", 'r', "size_t", 0, "Number of repetitions"},
21 {"block", 'b', NULL, 0, "Compute block dot products (versus a single dot product)"},
22 {"unroll_factor", 'm', "size_t", 0, "Runtime unrolling factor for dot product"},
23};
24
25static error_t parse_opt (int key, char *arg, struct argp_state *state)
26{
27 struct Args *args = state->input;
28 switch (key) {
29 case ARGP_KEY_INIT:
30 args->length = 100;
31 args->nreps = 10;
32 args->block = false;
33 args->unroll_factor = 4;
34 break;
35 case 'n':
36 args->length = strtol(arg, NULL, 10);
37 break;
38 case 'r':
39 args->nreps = strtol(arg, NULL, 10);
40 break;
41 case 'b':
42 args->block = true;
43 break;
44 case 'm':
45 args->unroll_factor = strtol(arg, NULL, 10);
46 break;
47 default:
48 return ARGP_ERR_UNKNOWN;
49 }
50 return 0;
51}
52
53// This part is to answer Part 1
54#define COMP_TIME_M 4
55
56// dot_opt_comp_time_m uses a compile-time unrolling factor (Part 1, Q4)
57double dot_opt_comp_time_m(size_t n, const double *a, const double *b) {
58
59 // check if n is divided evenly by the compile-time unrolling factor
60 if (n % COMP_TIME_M != 0) {
61 printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
62 return -1;
63 }
64
65 double sum = 0, sums[n / COMP_TIME_M];
66 memset(sums, 0, sizeof(sums));
67
68 for (size_t i=0; i<n; i+=COMP_TIME_M) {
69 for (size_t j=0; j<COMP_TIME_M; j++) {
70 sums[j] += a[i+j] * b[i+j];
71 }
72 }
73 for (size_t j=0; j < n/COMP_TIME_M; j++) {
74 sum += sums[j];
75 }
76 return sum;
77}
78
79// dot_opt_run_time_m uses a runtime unrolling factor (Part 1, Q5)
80double dot_opt_run_time_m(size_t n, const double *a, const double *b, const size_t m) {
81
82 // check if n is divided evenly by the compile-time unrolling factor
83 if (n % m != 0) {
84 printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
85 return -1;
86 }
87
88 double sum = 0, sums[n / m];
89
90 // initialize the dynamically created sums array to zero
91 memset(sums, 0, sizeof(sums));
92 // printf("Before for-loop \n");
93 for (size_t i=0; i<n; i+=m) {
94 // printf("Inside first for-loop \n");
95 for (size_t j=0; j < m; j++) {
96 // printf("Inside 2nd for-loop. I is %d, j is %d \n", i, j);
97 sums[j] += a[i+j] * b[i+j];
98 }
99 }
100 for (size_t j=0; j < n/m; j++) {
101 // printf("Inside the reduction for-loop. j is %d \n", j);
102 sum += sums[j];
103 }
104 return sum;
105}
106// This part ends to answer Part 1
107
108__attribute((noinline))
109static double dot_ref(size_t n, const double *a, const double *b) {
110 double sum = 0;
111 for (size_t i=0; i<n; i++)
112 sum += a[i] * b[i];
113 return sum;
114}
115
116// The following even-odd function given in the assignment tries to alleviate some data dependency from one loop iteration to another.
117// Note: the following function is NOT correct for all values of n. In fact, for n odd we incur in out-of-bound access when we perform the second partial sum sum1 += a[i+1] * b[i+1]
118double dot_opt_even_odd(size_t n, const double *a, const double *b) {
119 double sum0 = 0, sum1 = 0;
120 for (size_t i=0; i<n; i+=2) {
121 sum0 += a[i+0] * b[i+0];
122 sum1 += a[i+1] * b[i+1];
123 }
124 return sum0 + sum1;
125}
126
127__attribute((noinline))
128static double dot_opt1(size_t n, const double *a, const double *b) {
129 double sums[4] = {};
130 omp_set_num_threads(4);
131 #pragma omp parallel
132 {
133 int id = omp_get_thread_num();
134 for (size_t i=id; i<n; i+=4)
135 sums[id] += a[i] * b[i];
136 }
137 for (size_t j=1; j<4; j++) sums[0] += sums[j];
138 return sums[0];
139}
140
141// dot_opt1_simd is the dot_opt1 version to which we added the #pragma omp simd directive
142__attribute((noinline))
143static double dot_opt1_simd(size_t n, const double *a, const double *b) {
144 double sums[4] = {};
145 omp_set_num_threads(4);
146 #pragma omp parallel
147 {
148 int id = omp_get_thread_num();
149 #pragma omp simd
150 for (size_t i=id; i<n; i+=4)
151 sums[id] += a[i] * b[i];
152 }
153 #pragma omp simd
154 for (size_t j=1; j<4; j++) sums[0] += sums[j];
155 return sums[0];
156}
157
158__attribute((noinline))
159static double dot_opt2(size_t n, const double *a, const double *b) {
160 double sums[4] = {};
161 omp_set_num_threads(4);
162 #pragma omp parallel
163 {
164 int id = omp_get_thread_num();
165 #pragma omp for
166 for (size_t i=0; i<n; i++)
167 sums[id] += a[i] * b[i];
168 }
169 for (size_t j=1; j<4; j++) sums[0] += sums[j];
170 return sums[0];
171}
172
173// dot_opt2_simd is the dot_opt2 version to which we added the #pragma omp simd directive
174__attribute((noinline))
175static double dot_opt2_simd(size_t n, const double *a, const double *b) {
176 double sums[4] = {};
177 omp_set_num_threads(4);
178 #pragma omp parallel
179 {
180 int id = omp_get_thread_num();
181 #pragma omp for simd
182 for (size_t i=0; i<n; i++)
183 sums[id] += a[i] * b[i];
184 }
185 #pragma omp simd
186 for (size_t j=1; j<4; j++) sums[0] += sums[j];
187 return sums[0];
188}
189
190__attribute((noinline))
191static double dot_opt3(size_t n, const double *a, const double *b) {
192 double sum = 0;
193 omp_set_num_threads(4);
194 #pragma omp parallel
195 {
196 #pragma omp for reduction(+:sum)
197 for (size_t i=0; i<n; i++)
198 sum += a[i] * b[i];
199 }
200 return sum;
201}
202
203// dot_opt3_simd is the dot_opt3 version to which we added the #pragma omp simd directive
204__attribute((noinline))
205static double dot_opt3_simd(size_t n, const double *a, const double *b) {
206 double sum = 0;
207 omp_set_num_threads(4);
208 #pragma omp parallel
209 {
210 #pragma omp for simd reduction(+:sum)
211 for (size_t i=0; i<n; i++)
212 sum += a[i] * b[i];
213 }
214 return sum;
215}
216
217static void report_dot(const char *name, ticks_t start_ticks, size_t flops, double result) {
218 ticks_t ticks = rdtsc() - start_ticks;
219 double rate = 1.*flops / ticks;
220 if (fabs(result - flops) > 1e-10)
221 printf("Result %f failed to validate with expected value %ld\n", result, flops);
222 printf("%8s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
223}
224
225#define REPORT_DOT(f) do { \
226 for (int rep=0; rep<args.nreps; rep++) { \
227 ticks_t ticks_start = rdtsc(); \
228 report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b)); \
229 } \
230 } while (0)
231
232// The following macro is to test the perfomance of the runtime loop unrolling constant to answer Part 1.5
233#define REPORT_DOT_RUNTIME(f, m) do { \
234 for (int rep=0; rep<args.nreps; rep++) { \
235 ticks_t ticks_start = rdtsc(); \
236 report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b, m)); \
237 } \
238} while (0)
239
240// Dimensions of the matrices for block dot products.
241#define J 8
242#define K 4
243
244// Performs the operation
245// C = A * B
246// where A and B have shape (J,n) and (n,K) respectively.
247// This reference version stores A as row-major and B as column-major.
248static void bdot_ref(size_t n, const double *a, const double *b, double *c) {
249 for (size_t j=0; j<J; j++) {
250 for (size_t k=0; k<K; k++) {
251 c[j*K+k] = dot_ref(n, &a[j*n], &b[k*n]);
252 }
253 }
254}
255
256// bdot_opt1 internally calls dot_opt1
257static void bdot_opt1(size_t n, const double *a, const double *b, double *c) {
258
259 for (size_t j=0; j<J; j++) {
260 for (size_t k=0; k<K; k++) {
261 c[j*K+k] = dot_opt1(n, &a[j*n], &b[k*n]);
262 }
263 }
264}
265
266// bdot_opt1_simd internally calls dot_opt1_simd
267static void bdot_opt1_simd(size_t n, const double *a, const double *b, double *c) {
268
269 for (size_t j=0; j<J; j++) {
270 for (size_t k=0; k<K; k++) {
271 c[j*K+k] = dot_opt1_simd(n, &a[j*n], &b[k*n]);
272 }
273 }
274}
275
276// bdot_opt2 internally calls dot_opt2
277static void bdot_opt2(size_t n, const double *a, const double *b, double *c) {
278
279 for (size_t j=0; j<J; j++) {
280 for (size_t k=0; k<K; k++) {
281 c[j*K+k] = dot_opt2(n, &a[j*n], &b[k*n]);
282 }
283 }
284}
285
286// bdot_opt2_simd internally calls dot_opt2_simd
287static void bdot_opt2_simd(size_t n, const double *a, const double *b, double *c) {
288
289 for (size_t j=0; j<J; j++) {
290 for (size_t k=0; k<K; k++) {
291 c[j*K+k] = dot_opt2_simd(n, &a[j*n], &b[k*n]);
292 }
293 }
294}
295
296// bdot_opt3 internally calls dot_opt3
297static void bdot_opt3(size_t n, const double *a, const double *b, double *c) {
298
299 for (size_t j=0; j<J; j++) {
300 for (size_t k=0; k<K; k++) {
301 c[j*K+k] = dot_opt3(n, &a[j*n], &b[k*n]);
302 }
303 }
304}
305
306// bdot_opt3_simd internally calls dot_opt3_simd
307static void bdot_opt3_simd(size_t n, const double *a, const double *b, double *c) {
308
309 for (size_t j=0; j<J; j++) {
310 for (size_t k=0; k<K; k++) {
311 c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
312 }
313 }
314}
315
316// bdot_opt3_simd_reordered performs the J * K (=32) pairwise products swapping the k,j indices
317static void bdot_opt3_simd_reordered(size_t n, const double *a, const double *b, double *c) {
318
319 for (size_t k=0; k<K; k++) {
320 for (size_t j=0; j<J; j++) {
321 c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
322 }
323 }
324}
325
326
327static void init_bdot(size_t n, double *a, size_t ajstride, size_t aistride,
328 double *b, size_t bistride, size_t bkstride) {
329 for (size_t i=0; i<n; i++) {
330 for (size_t j=0; j<J; j++)
331 a[i*aistride + j*ajstride] = 1000*(i+1) + j+1;
332 for (size_t k=0; k<K; k++)
333 b[i*bistride + k*bkstride] = 1./(1000*(i+1) + k+1);
334 }
335}
336
337static void report_bdot(const char *name, ticks_t start_ticks, size_t flops,
338 const double *result, int jstride, int kstride,
339 const double *ref_result) {
340 ticks_t ticks = rdtsc() - start_ticks;
341 double rate = 1.*flops / ticks;
342 if (result && ref_result && result != ref_result) {
343 for (int j=0; j<J; j++) {
344 for (int k=0; k<K; k++) {
345 if (fabs(result[j*jstride + k*kstride] - ref_result[j*K+k]) > 1e-10) {
346 printf("Result[%d,%d] = %f failed to validate with expected value %f\n", j, k, result[j*jstride + k*kstride], ref_result[j*K+k]);
347 }
348 }
349 }
350 }
351 printf("%s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
352}
353
354#define REPORT_BDOT(f, c, jstride, kstride, c_ref) do { \
355 for (int rep=0; rep<args.nreps; rep++) { \
356 ticks_t ticks_start = rdtsc(); \
357 f(args.length, a, b, c); \
358 report_bdot(#f, ticks_start, 2*J*K*args.length, c, jstride, kstride, c_ref); \
359 } \
360 } while (0)
361
362int main(int argc, char **argv) {
363 struct Args args;
364 struct argp argp = {options, parse_opt, NULL, NULL};
365 argp_parse(&argp, argc, argv, 0, 0, &args);
366 size_t n = args.length;
367
368 switch (args.block) {
369 case false: {
370 double *a = malloc(n * sizeof(double));
371 double *b = malloc(n * sizeof(double));
372 for (size_t i=0; i<n; i++) {
373 a[i] = 2.*(i+1);
374 b[i] = 1./(i+1);
375 }
376
377 printf(" Name \tflops\tticks\tflops/tick\n");
378 REPORT_DOT(dot_ref);
379 REPORT_DOT(dot_opt1);
380 REPORT_DOT(dot_opt2);
381 REPORT_DOT(dot_opt3);
382 // To answer Part 1, we will test the excution for the even-odd, and compile/runtime loop unrolling constant versions
383 REPORT_DOT(dot_opt_even_odd);
384 REPORT_DOT(dot_opt_comp_time_m);
385 REPORT_DOT_RUNTIME(dot_opt_run_time_m, args.unroll_factor);
386
387 free(a); free(b);
388 } break;
389 case true: {
390 // Initialize the matrices (as flattened vectors)
391 double *a = malloc(J * n * sizeof(double));
392 double *b = malloc(K * n * sizeof(double));
393 double *c = malloc(J * K * sizeof(double));
394 double *c_ref = malloc(J * K * sizeof(double));
395
396 printf("Name \tflops\tticks\tflops/tick\n");
397 init_bdot(args.length, a, n, 1, b, 1, n);
398 REPORT_BDOT(bdot_ref, c_ref, K, 1, c_ref);
399
400 // You may initialize a and b differently and call more variants by editing
401 // the two lines below, or by creating new variants.
402 init_bdot(args.length, a, n, 1, b, 1, n);
403 // To answer Part 2, we will test the excution for the three different blocked opt versions
404 REPORT_BDOT(bdot_opt1, c, K, 1, c_ref);
405 REPORT_BDOT(bdot_opt2, c, K, 1, c_ref);
406 REPORT_BDOT(bdot_opt3, c, K, 1, c_ref);
407 REPORT_BDOT(bdot_opt1_simd, c, K, 1, c_ref);
408 REPORT_BDOT(bdot_opt2_simd, c, K, 1, c_ref);
409 REPORT_BDOT(bdot_opt3_simd, c, K, 1, c_ref);
410 REPORT_BDOT(bdot_opt3_simd_reordered, c, K, 1, c_ref);
411
412 free(a); free(b); free(c); free(c_ref);
413 } break;
414 }
415 return 0;
416}