24) HW3 solution

24) HW3 solution#

Part 1: Optimizing dot product (30%)#

(5%) The compile/runtime unrolling factor versions achieve performance similar to the dot_opt2 version, but not as good as the dot_opt3 version.

In addition to your commentary, answer the following questions:

(5%) Is this code correct for all values of n?
- No, for odd n, we incur in out-of-bound access when we perform the second product, a[i+1] * b[i+1].
(5%) Can this code change the numerically computed result?
- Yes, because of floating-point aritmethic (FPA). We’re adding an extra operation. If any of the numbers to add were exrtemely small (or large), we will suffer from roundoff errors. Also, as you’ve learned, the C standard does not allow floating point arithmetic to be reordered because it may change the computed values (non associativity).
(5%) Can you extend this technique to increase performance further (longer vector registers and/or more instruction-level parallelism)?
- Yes, we can definitely consider instead of an unrolling factor 2, chunking the data in larger sizes and apply SIMD-like operations (like we do in Part 2)
(5%) Can you make the unrolling factor 2 a compile-time constant, perhaps by using an inner loop?
- Yes, we can do that (see dot_opt_comp_time_m function)
(5%) Could that unrolling factor be a run-time parameter?
- Yes, we can do that (see dot_opt_run_time_m function). There are a couple of different ways this can be achieved.
  - If someone defined the runtime parameter, in say, main, then the compiler would get “smart” and optimize and treat it in the same way as a compile-time constant. Keep in mind that we’re compiling with the optimization flag -O3, so if we only used that unrolling factor in the call to our function in main, then the the compiler knows at compile time the value that we’re passing, and because we’re calling that function only once, it might just substitute the runtime constant as it were a constant known at compile time
  - But if someone actually added it as a true runtime unrolling factor, passed via a command-line argument, which the compiler cannot know in advance, then, yes, the performance of the runtime version would be lower than the compile-time counterpart.

! gcc -O3 -march=native -fopenmp  ../c_codes/hw3_solution/dot.c   -o dot

! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000

  Name  	flops	ticks	flops/tick
 dot_ref	20000	57800	    0.35	
 dot_ref	20000	57616	    0.35	
 dot_ref	20000	57626	    0.35	
 dot_ref	20000	57632	    0.35	
 dot_ref	20000	57564	    0.35	
 dot_ref	20000	57562	    0.35	
 dot_ref	20000	57614	    0.35	
 dot_ref	20000	57634	    0.35	
 dot_ref	20000	57560	    0.35	
 dot_ref	20000	57548	    0.35	
dot_opt1	20000	1757938	    0.01	
dot_opt1	20000	581676	    0.03	
dot_opt1	20000	696132	    0.03	
dot_opt1	20000	885786	    0.02	
dot_opt1	20000	865362	    0.02	
dot_opt1	20000	787246	    0.03	
dot_opt1	20000	599558	    0.03	
dot_opt1	20000	63432	    0.32	
dot_opt1	20000	67446	    0.30	
dot_opt1	20000	581704	    0.03	
dot_opt2	20000	1080788	    0.02	
dot_opt2	20000	874322	    0.02	
dot_opt2	20000	610828	    0.03	
dot_opt2	20000	1017028	    0.02	
dot_opt2	20000	885956	    0.02	
dot_opt2	20000	766450	    0.03	
dot_opt2	20000	991940	    0.02	
dot_opt2	20000	1187360	    0.02	
dot_opt2	20000	82834	    0.24	
dot_opt2	20000	71780	    0.28	
dot_opt3	20000	112104	    0.18	
dot_opt3	20000	667324	    0.03	
dot_opt3	20000	486988	    0.04	
dot_opt3	20000	651768	    0.03	
dot_opt3	20000	611812	    0.03	
dot_opt3	20000	658816	    0.03	
dot_opt3	20000	521840	    0.04	
dot_opt3	20000	587960	    0.03	
dot_opt3	20000	763112	    0.03	
dot_opt3	20000	79924	    0.25	
dot_opt_even_odd	20000	44448	    0.45	
dot_opt_even_odd	20000	43978	    0.45	
dot_opt_even_odd	20000	44952	    0.44	
dot_opt_even_odd	20000	44940	    0.45	
dot_opt_even_odd	20000	44824	    0.45	
dot_opt_even_odd	20000	44940	    0.45	
dot_opt_even_odd	20000	44444	    0.45	
dot_opt_even_odd	20000	44510	    0.45	
dot_opt_even_odd	20000	44020	    0.45	
dot_opt_even_odd	20000	44188	    0.45	
dot_opt_comp_time_m	20000	96980	    0.21	
dot_opt_comp_time_m	20000	34592	    0.58	
dot_opt_comp_time_m	20000	34528	    0.58	
dot_opt_comp_time_m	20000	34572	    0.58	
dot_opt_comp_time_m	20000	34418	    0.58	
dot_opt_comp_time_m	20000	34344	    0.58	
dot_opt_comp_time_m	20000	34192	    0.58	
dot_opt_comp_time_m	20000	34320	    0.58	
dot_opt_comp_time_m	20000	34184	    0.59	
dot_opt_comp_time_m	20000	34284	    0.58	
dot_opt_run_time_m	20000	100080	    0.20	
dot_opt_run_time_m	20000	99796	    0.20	
dot_opt_run_time_m	20000	99760	    0.20	
dot_opt_run_time_m	20000	99720	    0.20	
dot_opt_run_time_m	20000	99724	    0.20	
dot_opt_run_time_m	20000	99696	    0.20	
dot_opt_run_time_m	20000	99716	    0.20	
dot_opt_run_time_m	20000	99740	    0.20	
dot_opt_run_time_m	20000	99700	    0.20	
dot_opt_run_time_m	20000	99704	    0.20	

Part 2: Optimizing block inner product (%70)#

What does it mean to reorder loops? Will it help or hurt performance?

The two lines that originally use the j as slowest index and k as fastest index to perform the pair-wise products:

for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {

tell us that the same row of the matrix \(A^T\) is multiplied by all columns of \(B\), and we store the result in \(C\) row-wise.

If we swapped these two lines, (see the bdot_opt3_simd_reordered function where we do this), we would multiply the same column of \(B\) by all rows of \(A^T\) and we’d store the result in \(C\) column-wise.

The effect of doing so it is actually negligible on performance (on my architecture) and might bring only a very minimal advantage on some other architectures. If you think about it, this would only change how 32 pair-wise dot products are performed, but the majority of the work in this code is done in the single pair-wise dot products (20000 FLOPs each).

Does it help to change the layout in memory (see the aistride, ajstride, bistride, and bkstride parameters in the code)?

No, doing it without further algorithmic changes in the functions where we actually perform the dot products, does not help. First of all, the only way the stride parameters can be changed is in the call of the init_bdot function. Instead of calling init_bdot as we called it originally:

init_bdot(args.length, a, n, 1, b, 1, n);

one of the two arguments between n and 1 has to be kept to 1 to fill the whole matrices A and B without any gaps, as in:

init_bdot(args.length, a, 1, n, b, n, 1);

This would make A be stored column-wise and B stored row-wise (effectively transposing the matrices). If we did this, in order for the algorithm to still produce the same results of the pairwise dot products, we would also need to perform further algorithmic changes. Even if we did perform those further algorithmic changes to make the results correct, it would not help performance, as we would be back to the original situation.

One might have also thought to use smaller blocks/stride patterns, such as 1, J and 1, K. However, again, this would only change how the data entries are laid out, and without further algorithmic changes in the functions that actually call the pair-wise dot products (which are implemented to perform the dot product of the entire vectors), it would not be beneficial.

Try using the #pragma omp simd directive seen in class and the compiler option -fopenmp-simd.

See all the bdot_opt*_simd versions and related results (where * is 1,2,3). The addition of the #pragma omp simd directive only significantly helps the bdot_opt3 optimized version, and leaves the bdot_opt1 and bdot_opt2 results essentially unaltered.

! gcc -O3 -march=native -fopenmp-simd -fopenmp ../c_codes/hw3_solution/dot.c -o dot

! OMP_NUM_THREADS=4 ./dot -r 10 -n 10000 -b

Name    	flops	ticks	flops/tick
bdot_ref	640000	1553330	    0.41	
bdot_ref	640000	1482816	    0.43	
bdot_ref	640000	1473188	    0.43	
bdot_ref	640000	2491487	    0.26	
bdot_ref	640000	984068	    0.65	
bdot_ref	640000	1002864	    0.64	
bdot_ref	640000	980176	    0.65	
bdot_ref	640000	979916	    0.65	
bdot_ref	640000	1002372	    0.64	
bdot_ref	640000	983882	    0.65	
bdot_opt1	640000	2992558	    0.21	
bdot_opt1	640000	1881600	    0.34	
bdot_opt1	640000	1447478	    0.44	
bdot_opt1	640000	1534196	    0.42	
bdot_opt1	640000	1449300	    0.44	
bdot_opt1	640000	1481620	    0.43	
bdot_opt1	640000	1558464	    0.41	
bdot_opt1	640000	1465784	    0.44	
bdot_opt1	640000	1427708	    0.45	
bdot_opt1	640000	2893732	    0.22	
bdot_opt2	640000	3470022	    0.18	
bdot_opt2	640000	2293348	    0.28	
bdot_opt2	640000	2234964	    0.29	
bdot_opt2	640000	2242024	    0.29	
bdot_opt2	640000	2182178	    0.29	
bdot_opt2	640000	2213002	    0.29	
bdot_opt2	640000	2189554	    0.29	
bdot_opt2	640000	2258316	    0.28	
bdot_opt2	640000	2259706	    0.28	
bdot_opt2	640000	2250576	    0.28	
bdot_opt3	640000	2737568	    0.23	
bdot_opt3	640000	1605498	    0.40	
bdot_opt3	640000	1600662	    0.40	
bdot_opt3	640000	1587474	    0.40	
bdot_opt3	640000	1579622	    0.41	
bdot_opt3	640000	1580854	    0.40	
bdot_opt3	640000	1587928	    0.40	
bdot_opt3	640000	1795260	    0.36	
bdot_opt3	640000	1656106	    0.39	
bdot_opt3	640000	1523864	    0.42	
bdot_opt1_simd	640000	1655884	    0.39	
bdot_opt1_simd	640000	1654972	    0.39	
bdot_opt1_simd	640000	1656084	    0.39	
bdot_opt1_simd	640000	1657640	    0.39	
bdot_opt1_simd	640000	1624626	    0.39	
bdot_opt1_simd	640000	1581086	    0.40	
bdot_opt1_simd	640000	1623642	    0.39	
bdot_opt1_simd	640000	1671630	    0.38	
bdot_opt1_simd	640000	1575028	    0.41	
bdot_opt1_simd	640000	1586240	    0.40	
bdot_opt2_simd	640000	2220036	    0.29	
bdot_opt2_simd	640000	2211386	    0.29	
bdot_opt2_simd	640000	2267582	    0.28	
bdot_opt2_simd	640000	2267962	    0.28	
bdot_opt2_simd	640000	2270632	    0.28	
bdot_opt2_simd	640000	2211588	    0.29	
bdot_opt2_simd	640000	2207122	    0.29	
bdot_opt2_simd	640000	2232456	    0.29	
bdot_opt2_simd	640000	2240136	    0.29	
bdot_opt2_simd	640000	2225774	    0.29	
bdot_opt3_simd	640000	1489792	    0.43	
bdot_opt3_simd	640000	1467498	    0.44	
bdot_opt3_simd	640000	1378286	    0.46	
bdot_opt3_simd	640000	1399970	    0.46	
bdot_opt3_simd	640000	1330830	    0.48	
bdot_opt3_simd	640000	1436224	    0.45	
bdot_opt3_simd	640000	1444604	    0.44	
bdot_opt3_simd	640000	1393742	    0.46	
bdot_opt3_simd	640000	1316022	    0.49	
bdot_opt3_simd	640000	1350594	    0.47	
bdot_opt3_simd_reordered	640000	1312090	    0.49	
bdot_opt3_simd_reordered	640000	1354006	    0.47	
bdot_opt3_simd_reordered	640000	1335878	    0.48	
bdot_opt3_simd_reordered	640000	1303826	    0.49	
bdot_opt3_simd_reordered	640000	1390632	    0.46	
bdot_opt3_simd_reordered	640000	1333384	    0.48	
bdot_opt3_simd_reordered	640000	1371484	    0.47	
bdot_opt3_simd_reordered	640000	1316464	    0.49	
bdot_opt3_simd_reordered	640000	1303554	    0.49	
bdot_opt3_simd_reordered	640000	1357476	    0.47	

Code:#

#include "rdtsc.h"

#include <argp.h>
#include <math.h>
#include <omp.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct Args {
  size_t length;
  size_t nreps;
  bool block;
  size_t unroll_factor;
};

static struct argp_option options[] = {
  {"length", 'n', "size_t", 0, "Length of each vector"},
  {"nreps", 'r', "size_t", 0, "Number of repetitions"},
  {"block", 'b', NULL, 0, "Compute block dot products (versus a single dot product)"},
  {"unroll_factor", 'm', "size_t", 0, "Runtime unrolling factor for dot product"},
};

static error_t parse_opt (int key, char *arg, struct argp_state *state)
{
  struct Args *args = state->input;
  switch (key) {
  case ARGP_KEY_INIT:
    args->length = 100;
    args->nreps = 10;
    args->block = false;
    args->unroll_factor = 4;
    break;
  case 'n':
    args->length = strtol(arg, NULL, 10);
    break;
  case 'r':
    args->nreps = strtol(arg, NULL, 10);
    break;
  case 'b':
    args->block = true;
    break;
  case 'm':
    args->unroll_factor = strtol(arg, NULL, 10);
    break;
  default:
    return ARGP_ERR_UNKNOWN;
  }
  return 0;
}

// This part is to answer Part 1
#define COMP_TIME_M 4

// dot_opt_comp_time_m uses a compile-time unrolling factor (Part 1, Q4)
double dot_opt_comp_time_m(size_t n, const double *a, const double *b) {

  // check if n is divided evenly by the compile-time unrolling factor
  if (n % COMP_TIME_M != 0) {
    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
    return -1;
  }

  double sum = 0, sums[n / COMP_TIME_M];
  memset(sums, 0, sizeof(sums));

  for (size_t i=0; i<n; i+=COMP_TIME_M) {
    for (size_t j=0; j<COMP_TIME_M; j++) {
      sums[j] += a[i+j] * b[i+j];
    }
  }
  for (size_t j=0; j < n/COMP_TIME_M; j++) {
    sum += sums[j];
  }
  return sum;
}

// dot_opt_run_time_m uses a runtime unrolling factor (Part 1, Q5)
double dot_opt_run_time_m(size_t n, const double *a, const double *b, const size_t m) {

  // check if n is divided evenly by the compile-time unrolling factor
  if (n % m != 0) {
    printf("Error, the compile time unrolling factor must be perfectly contained in n \n");
    return -1;
  }

  double sum = 0, sums[n / m];

  // initialize the dynamically created sums array to zero
  memset(sums, 0, sizeof(sums));
  // printf("Before for-loop \n");
  for (size_t i=0; i<n; i+=m) {
    // printf("Inside first for-loop \n");
    for (size_t j=0; j < m; j++) {
      // printf("Inside 2nd for-loop. I is %d, j is %d \n", i, j);
      sums[j] += a[i+j] * b[i+j];
    }
  }
  for (size_t j=0; j < n/m; j++) {
    // printf("Inside the reduction for-loop. j is %d \n", j);
    sum += sums[j];
  }
  return sum;
}
// This part ends to answer Part 1

__attribute((noinline))
static double dot_ref(size_t n, const double *a, const double *b) {
  double sum = 0;
  for (size_t i=0; i<n; i++)
    sum += a[i] * b[i];
  return sum;
}

// The following even-odd function given in the assignment tries to alleviate some data dependency from one loop iteration to another.
// Note: the following function is NOT correct for all values of n. In fact, for n odd we incur in out-of-bound access when we perform the second partial sum sum1 += a[i+1] * b[i+1]
double dot_opt_even_odd(size_t n, const double *a, const double *b) {
  double sum0 = 0, sum1 = 0;
  for (size_t i=0; i<n; i+=2) {
    sum0 += a[i+0] * b[i+0];
    sum1 += a[i+1] * b[i+1];
  }
  return sum0 + sum1;
}

__attribute((noinline))
static double dot_opt1(size_t n, const double *a, const double *b) {
  double sums[4] = {};
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    for (size_t i=id; i<n; i+=4)
      sums[id] += a[i] * b[i];
  }
  for (size_t j=1; j<4; j++) sums[0] += sums[j];
  return sums[0];
}

// dot_opt1_simd is the dot_opt1 version to which we added the #pragma omp simd directive
__attribute((noinline))
static double dot_opt1_simd(size_t n, const double *a, const double *b) {
  double sums[4] = {};
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    #pragma omp simd
    for (size_t i=id; i<n; i+=4)
      sums[id] += a[i] * b[i];
  }
  #pragma omp simd
  for (size_t j=1; j<4; j++) sums[0] += sums[j];
  return sums[0];
}

__attribute((noinline))
static double dot_opt2(size_t n, const double *a, const double *b) {
  double sums[4] = {};
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    #pragma omp for
    for (size_t i=0; i<n; i++)
      sums[id] += a[i] * b[i];
  }
  for (size_t j=1; j<4; j++) sums[0] += sums[j];
  return sums[0];
}

// dot_opt2_simd is the dot_opt2 version to which we added the #pragma omp simd directive
__attribute((noinline))
static double dot_opt2_simd(size_t n, const double *a, const double *b) {
  double sums[4] = {};
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    #pragma omp for simd
    for (size_t i=0; i<n; i++)
      sums[id] += a[i] * b[i];
  }
  #pragma omp simd
  for (size_t j=1; j<4; j++) sums[0] += sums[j];
  return sums[0];
}

__attribute((noinline))
static double dot_opt3(size_t n, const double *a, const double *b) {
  double sum = 0;
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    #pragma omp for reduction(+:sum)
    for (size_t i=0; i<n; i++)
      sum += a[i] * b[i];
  }
  return sum;
}

// dot_opt3_simd is the dot_opt3 version to which we added the #pragma omp simd directive
__attribute((noinline))
static double dot_opt3_simd(size_t n, const double *a, const double *b) {
  double sum = 0;
  omp_set_num_threads(4);
  #pragma omp parallel
  {
    #pragma omp for simd reduction(+:sum)
    for (size_t i=0; i<n; i++)
      sum += a[i] * b[i];
  }
  return sum;
}

static void report_dot(const char *name, ticks_t start_ticks, size_t flops, double result) {
  ticks_t ticks = rdtsc() - start_ticks;
  double rate = 1.*flops / ticks;
  if (fabs(result - flops) > 1e-10)
    printf("Result %f failed to validate with expected value %ld\n", result, flops);
  printf("%8s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
}

#define REPORT_DOT(f) do {                                              \
    for (int rep=0; rep<args.nreps; rep++) {                            \
      ticks_t ticks_start = rdtsc();                                    \
      report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b)); \
    }                                                                   \
  } while (0)

// The following macro is to test the perfomance of the runtime loop unrolling constant to answer Part 1.5
#define REPORT_DOT_RUNTIME(f, m) do {                                              \
  for (int rep=0; rep<args.nreps; rep++) {                            \
    ticks_t ticks_start = rdtsc();                                    \
    report_dot(#f, ticks_start, 2*args.length, f(args.length, a, b, m)); \
  }                                                                   \
} while (0)

// Dimensions of the matrices for block dot products.
#define J 8
#define K 4

// Performs the operation
//   C = A * B
// where A and B have shape (J,n) and (n,K) respectively.
// This reference version stores A as row-major and B as column-major.
static void bdot_ref(size_t n, const double *a, const double *b, double *c) {
  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_ref(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt1 internally calls dot_opt1
static void bdot_opt1(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt1(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt1_simd internally calls dot_opt1_simd
static void bdot_opt1_simd(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt1_simd(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt2 internally calls dot_opt2
static void bdot_opt2(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt2(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt2_simd internally calls dot_opt2_simd
static void bdot_opt2_simd(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt2_simd(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt3 internally calls dot_opt3
static void bdot_opt3(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt3(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt3_simd internally calls dot_opt3_simd
static void bdot_opt3_simd(size_t n, const double *a, const double *b, double *c) {

  for (size_t j=0; j<J; j++) {
    for (size_t k=0; k<K; k++) {
      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
    }
  }
}

// bdot_opt3_simd_reordered performs the J * K (=32) pairwise products swapping the k,j indices
static void bdot_opt3_simd_reordered(size_t n, const double *a, const double *b, double *c) {

  for (size_t k=0; k<K; k++) {
    for (size_t j=0; j<J; j++) {
      c[j*K+k] = dot_opt3_simd(n, &a[j*n], &b[k*n]);
    }
  }
}


static void init_bdot(size_t n, double *a, size_t ajstride, size_t aistride,
                      double *b, size_t bistride, size_t bkstride) {
  for (size_t i=0; i<n; i++) {
    for (size_t j=0; j<J; j++)
      a[i*aistride + j*ajstride] = 1000*(i+1) + j+1;
    for (size_t k=0; k<K; k++)
      b[i*bistride + k*bkstride] = 1./(1000*(i+1) + k+1);
  }
}

static void report_bdot(const char *name, ticks_t start_ticks, size_t flops,
                        const double *result, int jstride, int kstride,
                        const double *ref_result) {
  ticks_t ticks = rdtsc() - start_ticks;
  double rate = 1.*flops / ticks;
  if (result && ref_result && result != ref_result) {
    for (int j=0; j<J; j++) {
      for (int k=0; k<K; k++) {
        if (fabs(result[j*jstride + k*kstride] - ref_result[j*K+k]) > 1e-10) {
          printf("Result[%d,%d] = %f failed to validate with expected value %f\n", j, k, result[j*jstride + k*kstride], ref_result[j*K+k]);
        }
      }
    }
  }
  printf("%s\t%ld\t%lld\t%8.2f\t\n", name, flops, ticks, rate);
}

#define REPORT_BDOT(f, c, jstride, kstride, c_ref) do {                 \
    for (int rep=0; rep<args.nreps; rep++) {                            \
      ticks_t ticks_start = rdtsc();                                    \
      f(args.length, a, b, c);                                          \
      report_bdot(#f, ticks_start, 2*J*K*args.length, c, jstride, kstride, c_ref); \
    }                                                                   \
  } while (0)

int main(int argc, char **argv) {
  struct Args args;
  struct argp argp = {options, parse_opt, NULL, NULL};
  argp_parse(&argp, argc, argv, 0, 0, &args);
  size_t n = args.length;

  switch (args.block) {
  case false: {
    double *a = malloc(n * sizeof(double));
    double *b = malloc(n * sizeof(double));
    for (size_t i=0; i<n; i++) {
      a[i] = 2.*(i+1);
      b[i] = 1./(i+1);
    }

    printf("  Name  \tflops\tticks\tflops/tick\n");
    REPORT_DOT(dot_ref);
    REPORT_DOT(dot_opt1);
    REPORT_DOT(dot_opt2);
    REPORT_DOT(dot_opt3);
    // To answer Part 1, we will test the excution for the even-odd, and compile/runtime loop unrolling constant versions
    REPORT_DOT(dot_opt_even_odd);
    REPORT_DOT(dot_opt_comp_time_m);
    REPORT_DOT_RUNTIME(dot_opt_run_time_m, args.unroll_factor);

    free(a); free(b);
  } break;
  case true: {
    // Initialize the matrices (as flattened vectors)
    double *a = malloc(J * n * sizeof(double));
    double *b = malloc(K * n * sizeof(double));
    double *c = malloc(J * K * sizeof(double));
    double *c_ref = malloc(J * K * sizeof(double));

    printf("Name    \tflops\tticks\tflops/tick\n");
    init_bdot(args.length, a, n, 1, b, 1, n);
    REPORT_BDOT(bdot_ref, c_ref, K, 1, c_ref);

    // You may initialize a and b differently and call more variants by editing
    // the two lines below, or by creating new variants.
    init_bdot(args.length, a, n, 1, b, 1, n);
    // To answer Part 2, we will test the excution for the three different blocked opt versions
    REPORT_BDOT(bdot_opt1, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt2, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt3, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt1_simd, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt2_simd, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt3_simd, c, K, 1, c_ref);
    REPORT_BDOT(bdot_opt3_simd_reordered, c, K, 1, c_ref);

    free(a); free(b); free(c); free(c_ref);
  } break;
  }
  return 0;
}

24) HW3 solution

Contents

24) HW3 solution#

Part 1: Optimizing dot product (30%)#

Part 2: Optimizing block inner product (%70)#

Code:#