Parallel computing | Applied Machine Learning

Julia comes ready for parallelizing your code. It does not need any external libraries for that.

SIMD

Single Instruction, Multiple Data (SIMD) is method of parallelization where a single operation can be applied to multiple variables, simultaneously.

SIMD is useful when the following conditions are true:

Each iteration of the loop is independent, meaning iterations do not use values from previous iterations.
The loop body does not have branches (e.g. if/else statements) or function calls.
The number of iterations of the loop is specified.
Bounds checking is disabled as it may cause branches.

To add SIMD to a loop we can simply use @simd:

function sum_vectors!(y, z) 
   n = length(y)
   x = zeros(eltype(y), n)
   @inbounds @simd for i in 1:n 
      x[i] = y[i] + z[i] 
   end 
end

a = rand(Float32, 1_000_000);
b = rand(Float32, 1_000_000)  

@time sum_vectors!($a, $b)

Concurrent programming with tasks

A Julia program typically runs on a single thread, on a single processing core. That is, the programs are synchronous. But a computer can run multiple tasks at the same time.
Asynchronous processing is defining computing tasks that can be executed without depending on other tasks. A task is a set of procedures, like a function, that can be stopped and start before it is finished. The difference between a task and a function is that a can be interrupted while it is running, and there is no cost in starting a task.
These features of tasks can make them seem running simultaneously, while in fact one task runs on a processor at any time. But since two or more tasks can switch their places and be executed partly, it seems like they are running in parallel.
For a task to be stopped, it should first signal the task scheduler that it is ready to be stopped. This signal is given with the yield function.
To define an asynchronous task, use the @async macro.
Using tasks leads to performance gain only when each task is not fully using the CPU because tasks are not really running in parallel. A typical case where tasks need to wait without using the CPU is during reading from and writing to disk.
As an example, see the loop below, takes only 1 second to finish, instead of five seconds if it was not using tasks:
```
julia> @time @sync for i in 1:5
         @sync sleep(i)
       end
```
In this case, sleep sends the yield message to the task scheduler, so that it knows that it can switch to another task. Not all functions send the yield command automatically and require us to yield explicitly.
Tasks can be managed manually too. To define a task, use the Task function which requires a function with no arguments. For example:
```
julia> t=Task(()->sin(0.6))
```
This task is created by is not run yet. You can check whether the task is started or finished, with the istaskstarted and istaskdone functions, respectively. current_task() shows a list of running tasks.
To start a task, we should give it to the scheduler, which decides when it is best to be run:
```
  julia> schedule(t)
```

Shared memory multithreading

Threads are sequences of instructions that can run independently. Number of threads that you can run is limited, unlike tasks. In most modern CPUs, you have at most twice as many threads as there are cores (because of hyper-threading technology). But the actual parallel power is as many cores as your computer has, because hyper-threading is a “simulated parallelism” and does not provide much benefit when each thread is CPU intensive.
The number of threads Julia uses is determined before starting Julia. By default, it is 1. To change it, use the JULIA_NUM_THREADS variable in the command line:
```
$ export JULIA_NUM_THREADS = 2
```
or by using the -t/--threads command line argument.

Starting Julia with 3 threads (julia --threads 3), we can see that iterations of a loop run in different threads:

julia> a = zeros(10)
  10-element Vector{Float64}:
0
0
0
0
0
0
0
0
0
0
julia> @threads for i in 1:10
     a[i] = threadid()
     end
julia> a
  10-element Vector{Float64}:
0
0
0
0
0
0
0
0
0
0

Threads are multiple processing cores running at the same time while accessing the whole computer memory. A problem can happen here, that is, when two threads try to modify the same piece of memory. For example, you cannot put rand, a random number generator, inside a loop and add @threads in front of the loop. That is because the random number generator updates its internal state each time it produces a random number. If multiple processes try to produce a random number, they will try to change the same piece of memory simultaneously. A fix for this example would be to give each thread its own independent random number generator.
You do not always have to use @threads and be careful about memory management. Some libraries are designed to take advantage of all available threads by defaults. A commonly used example of which is OpenBLAS that handles matrix operations. If your code includes significant amounts of matrix multiplications, it will perform better on a machine with more cores. For this, you do not even need to start Julia with multiple threads because OpenBLAS manages threads independently.

Distributed computing

Distributed computing is a way to run parallel computation on multiple CPU cores or clusters of computers without having to worry about each process accessing the shared memory.
This is a simpler procedure than multithreading.

Start multiple processors in Julia

You can either start Julia with multiple processors:

$ julia -p 2

This prepares two CPU cores for the program.

Or start them from within Julia:

julia> using Distributed
julia> addprocs(2)

The procs() function returns the IDs of running Julia processes.

Preparing code on all processors

To run code on multiple cores, we need to make the code available on all the cores. @everywhere macro from the Distributed package. For example, to load a package on all processors:

julia> using Distributed
julia> addprocs(2)
julia> @everywhere using Random

Parallel map

Parallel map function is good for situations where you want to parallelize a for loop while each iteration performs a large amount of computations.

julia> using Distributed
julia> addprocs(14)
julia> @everywhere using LinearAlgebra

julia> x = [rand(100,100) for i in 1:10];

julia> @time map(svd, x);
  0.034314 seconds (102 allocations: 4.706 MiB)

julia> @btime pmap(svd, x);
  0.004444 seconds (1.48 k allocations: 1.595 MiB)

pmap(fn, input) runs the function fn for each element of the input. This works well when each iteration returns its result independently of other iterations. But when you want each iteration to update an existing array, we cannot use standard arrays. That is because each processor will have access to its own copy of the array.

Package SharedArrays provides an array type that can be shared among multiple processors on the same machine. Using it is simple. We just need to call SharedArray instead of Array:

using SharedArrays

S = SharedArray{Float64}((100, 100, 12))

pmap(x->S[x]=myid(), eachindex(S))

The final function records the id of the processor on which each iteration of the loop is run. The loop goes through each element of the array S.

Tags: