22) Collective communication

22) Collective communication#

Last time:

Blocking and non-blocking point-to-point communications

Today:

MPI Collective Communication
Minimum Spanning Trees

1. MPI Collective Communication#

Tip

Resources for the lecture:

Article: Chan et al., Collective communication: theory, practice, and experience
Lecture Notes from the University of Texas at Austin: Robert van de Geijn (RVDG) Collective Communication: Theory and Practice

In the previous lecture we considered point-to-point communication, that is communication between two MPI ranks. In this lecture we will consider communication that involves all the ranks in a communicator.

Reductions#

Recall that reductions are a case of collective communication operations.

An operator is a reduction operator if:

It can reduce an array to a single scalar value.

The final result should be obtainable from the results of the partial tasks that were created.

These two requirements are satisfied for commutative and associative operators that are applied to all array elements.

Example#

Suppose we have an array \(x = [1,2,3,4,5,6,7,8]\). The sum of this array can be computed serially by sequentially reducing the array into a single sum using the + operator.

Starting the summation from the beginning of the array yields:

\[ (((((((1 + 2) + 3) + 4) + 5) + 6) + 7) + 8) \]

Since + is both commutative and associative, it is a reduction operator. Therefore this reduction can be performed in parallel using several processes/cores, where each process/core computes the sum of a subset of the array, and the reduction operator merges the results.

Using a binary tree reduction would allow \(4\) processes/cores to compute

\[ \underbrace{ \underbrace{ \underbrace{(1 + 2)}_{p_0} + \underbrace{(3 + 4)}_{p_1}}_{p_0} + \underbrace{ \underbrace{(5 + 6)}_{p_1} + \underbrace{(7 + 8)}_{p_3}}_{p_1} }_{p_0} \]

So a total of \(4\) cores can be used to compute the sum in \(\log_n \equiv \log_2 8 = 3\) steps instead of the \(7\) steps required for the serial version.

Of course the result is the same, but only because of the associativity of the reduction operator. The commutativity of the reduction operator would be important if there were a master core distributing work to several processors, since then the results could arrive back to the master processor in any order. The property of commutativity guarantees that the result will be the same.

From Fig. 1 in Chan et al.

Figure 1 of Chang et al, Collective communication: theory, practice, and experience article

Types of possible collective communication:

Broadcast: one rank sends data to all the other ranks (RVDG: 93, MPI_Bcast)
Reduce(-to-one): Combine (e.g., sum, max/min, etc.) information from all ranks to one rank (RVDG: 94, MPI_Reduce)
Scatter: One rank send data to all other ranks (RVDG: 96, MPI_Scatter)
Gather: All ranks send data to one rank (RVDG: 97, MPI_Gather)
Allgather: All ranks sends information to all ranks (RVDG: 99, MPI_Allgather)
Reduce-scatter: Reduce and scatter out the reduced result (RVDG: 101, MPI_Reduce_scatter)
Allreduce: All ranks combine information from all ranks and the result is available to all ranks (RVDG: 102, MPI_Allreduce)

Note that there are pairs of reciprocal/dual operations:

Broadcast/Reduce(-to-one) (RVDG: 95)
Scatter/Gather (RVDG: 98)
Allgather/Reduce-scatter (RVDG: 101)

Allreduce is the only operationthat does not have a dual (or it can be viewed as its own dual).

Two broad classes of collective operations (Chan et al., 1752):

Data redistribution operations: Broadcast, scatter, gather, and allgather. These operations move data between processors.

Data consolidation operations: Reduce(-to-one), reduce–scatter, and allreduce. These operations consolidate contributions from different processors by applying a reduction operation. We will only consider reduction operations that are both commutative and associative.

2. Minimum Spanning Trees (MST)#

2.1 Broadcast#

We want to perform a broadcast operation, i.e., we want to send a message from a root rank to all ranks.

Naive Algorithm#

A naive broadcast just has the root send the message to each rank.

You can find the following code at julia_codes/module6-3/naivebcast.jl.

# Naive broadcast just has the root send the message to each rank
function naivebcast!(buf, root, mpicomm)
  # Figure out who we are
  mpirank = MPI.Comm_rank(mpicomm)

  # If I am the root send the message to everyone
  if mpirank == root
    # How many total ranks are there
    mpisize = MPI.Comm_size(mpicomm)

    # Create an array for the requests
    reqs = Array{MPI.Request}(undef, mpisize)

    # Loop through ranks and send message
    for n = 1:mpisize
      # MPI uses 0 based indexing for ranks
      neighbor = n-1

      # If its me jst set my request to NULL (e.g., no-op)
      if neighbor == mpirank
        reqs[n] = MPI.REQUEST_NULL
      else # otherwise send message to neighbor
        reqs[n] = MPI.Isend(buf, neighbor, 7, mpicomm)
      end
    end
    # Wait on all the requests
    MPI.Waitall!(reqs)
  else # Since we are not the root, we receive
    MPI.Recv!(buf, root, 7, mpicomm)
  end
end

And the following testing code at julia_codes/module6-3/naivebcast_test.jl.

using MPI
include("naivebcast.jl")

let
  # Initialize MPI
  MPI.Init()

  # store communicator
  mpicomm = MPI.COMM_WORLD

  # Get some MPI info
  mpirank = MPI.Comm_rank(mpicomm)
  mpisize = MPI.Comm_size(mpicomm)

  # Divide all ranks halfway to determine the root
  root = div(mpisize, 2) # Integer division

  # create buffer for the communication
  buf = [mpirank]

  # have root broadcast message to everyone
  naivebcast!(buf, root, mpicomm)

  # check to make sure we got back the right message
  @assert buf[1] == root

  MPI.Barrier(mpicomm)
  time = @elapsed begin
    naivebcast!(buf, root, mpicomm)
    MPI.Barrier(mpicomm)
  end

  # Let's print the execution time:
  # Short hand for:
  #=
  if mpirank == 0
    print("Execution time: ", time,"\n")
  end
  =#
  mpirank == 0 && print("Execution time: ", time,"\n")

  # shutdown MPI
  MPI.Finalize()
end

Minimum Spanning Tree Algorithm#

See Chan et al. Fig. 3(a) & Fig 4(a); RVDG pages 108-171.

You can find the following code at julia_codes/module6-3/mstbcast.jl.

# Example implementation of MSTBcast (Fig. 3(a)) from
# Chan, E., Heimlich, M., Purkayastha, A. and van de Geijn, R. (2007),
# Collective communication: theory, practice, and experience. Concurrency
# Computat.: Pract. Exper., 19: 1749–1783. doi:10.1002/cpe.1206
#
# In this minimum spanning tree (mst) broadcast algorithm:
#   - Divide ranks into two (almost) equal group
#   - root sends data to one rank in other group (called the dest)
#   - recurse on two groups with root and dest being the "roots" of respective
#     groups
#
#  For nine ranks with root = 1 the algorithm would be (letters just represent
#  who sends/recvs data)
#
#  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8
#  ---------------------------------
#    | x |   |   |   |   |   |   |
#    | a |   |   |   |   |   |   | a  1->8
#    | a |   |   | a |   | b |   | b  1->4, 8->6
#    | a | a | c | c | d | d | b | b  1->2, 4->3, 6->5, 8->7
#  a | a | x | x | x | x | x | x | x  1->0
function mstbcast!(buf, root, mpicomm;
                   left = 0, right = MPI.Comm_size(mpicomm)-1)

  # If there is no one else, let's get outta here!
  left == right && return
  # Short hand for:
  #=
  if left == right
    return
  end
  =#

  # Determine the split
  mid = div(left + right, 2) # integer division

  # Whom do I send to?
  dest = (root <= mid) ? right : left
  # Short hand for:
  #=
  if root <= mid
    dest = right;
  else
    dest = left;
  end
  =#

  # Figure out who we are
  mpirank = MPI.Comm_rank(mpicomm)

  # If I'm the root or dest send or recv (respectively)
  req = MPI.REQUEST_NULL
  if mpirank == root
    req = MPI.Isend(buf, dest, 7, mpicomm)
  elseif mpirank == dest
    MPI.Recv!(buf, root, 7, mpicomm)
  end

  # Recursion:
  # I'm in the left group and the root is my new root
  if mpirank <= mid && root <= mid
    mstbcast!(buf, root, mpicomm; left=left, right=mid)
  # I'm in the left group and the dest is my new root
  elseif mpirank <= mid && root > mid
    mstbcast!(buf, dest, mpicomm; left=left, right=mid)
  # I'm in the right group and the dest is my new root
  elseif mpirank > mid && root <= mid
    mstbcast!(buf, dest, mpicomm, left=mid + 1, right=right)
  # I'm in the right group and the root is my new root
  elseif mpirank > mid && root > mid
    mstbcast!(buf, root, mpicomm, left=mid + 1, right=right)
  end

  # Make sure all my sends are done before I get outta dodge
  MPI.Wait!(req)
end

And the following testing code at julia_codes/module6-3/mstbcast_test.jl.

using MPI
include("mstbcast.jl")

let
  # Initialize MPI
  MPI.Init()

  # store communicator
  mpicomm = MPI.COMM_WORLD

  # Get some MPI info
  mpirank = MPI.Comm_rank(mpicomm)
  mpisize = MPI.Comm_size(mpicomm)

  # Divide all ranks halfway to determine the root
  root = div(mpisize, 2) # Integer division

  # create buffer for the communication
  buf = [mpirank]

  # have root broadcast message to everyone
  mstbcast!(buf, root, mpicomm)

  # check to make sure we got back the right message
  @assert buf[1] == root

  MPI.Barrier(mpicomm)
  time = @elapsed begin
    mstbcast!(buf, root, mpicomm)
    MPI.Barrier(mpicomm)
  end

  # Let's print the execution time:
  # Short hand for:
  #=
  if mpirank == 0
    print("Execution time: ", time,"\n")
  end
  =#
  mpirank == 0 && print("Execution time: ", time,"\n")

  # shutdown MPI
  MPI.Finalize()
end

Let’s compare them#

You can find the driver for the comparison code at julia_codes/module6-3/bcast_compare.jl.

using MPI
include("naivebcast.jl")
include("mstbcast.jl")
let
  # Initialize MPI
  MPI.Init()

  # store communicator
  mpicomm = MPI.COMM_WORLD

  # Get some MPI info
  mpirank = MPI.Comm_rank(mpicomm)
  mpisize = MPI.Comm_size(mpicomm)

  # Divide all ranks halfway to determine the root
  root = div(mpisize, 2) # Integer division

  # create buffer for the communication
  buf = [mpirank]

  # have root broadcast message to everyone
  mstbcast!(buf, root, mpicomm)
  naivebcast!(buf, root, mpicomm)

  mst_t1 = time_ns() # The time_ns() function in Julia returns the current time in nanoseconds
  mstbcast!(buf, root, mpicomm)
  mst_t2 = time_ns()
  nve_t1 = time_ns()
  naivebcast!(buf, root, mpicomm)
  nve_t2 = time_ns()

  mpirank == 0 && print("Elapsed time for the naive algorithm: \n")
  mpirank == 0 && @show (nve_t2 - nve_t1) * 1e-9
  mpirank == 0 && print("Elapsed time for the mst algorithm: \n")
  mpirank == 0 && @show (mst_t2 - mst_t1) * 1e-9

  # shutdown MPI
  MPI.Finalize()
end

Minimum Spanning Tree Reduce?#

For a Minimum Spanning Tree (MST) Reduce see See Chan et al. Fig. 3(b) & Fig. 4(b); RVDG pages 172-184.

Note: It’s possible to write the mstbcast (and a corresponding MST reduce version) without recursion (although we won’t see it in this class) and doing it with blocking communication is easier than non-blocking…