Running Distributed Simulations

Generating input files

To use QXContexts we first generate a set of input files using the generate_simulation_files function. This function accepts any circuit type for which the convert_to_tnc function has been implemented. This includes QXZoo and YaoBlocks circuits, see the Circuits with QXZoo and YaoQX tutorial for more details on creating such circuits. The following example will use a 49 qubit Random Quantum Circuit (RQC) of depth 16.

using QXTools

circ = create_rqc_circuit(7, 7, 16, 42, final_h=true)
output_args = output_params_dict(49, 100)
generate_simulation_files(circ, "rqc_7_7_16", time=10, output_args=output_args)

The above example will run the contraction finding algorithm for 10 seconds and write input files rqc_7_7_16.qx, rqc_7_7_16.yaml and rqc_7_7_16.jld2. By default two bonds will be sliced and the simulation files will specify that simulation should be run for 100 random bitstrings. Further details on the format and contents of these files is described in the User Guide and further details of the slicing and output sampling options of the generate_simulation_files function are described in the Miscellaneous Features tutorial.

Running the simulation

Once the input files have been generated, simulations can be run using the bin/qxrun.jl script which can be run directly from the command line (not the Julia REPL). To run a simulation using the input files generated in the example above one would use

julia --project bin/qxrun.jl -d rqc_7_7_16.qx -o rqc_7_7_16_output.jld2

This will run the simulation and write the output to the rqc_7_7_16_output.jld2 file.

To run simulations in parallel the MPI.jl package must first be installed and the mpiexecjl utility installed. This can be done with the following commands:

]add MPI
import MPI
MPI.install_mpiexecjl()

On compute clusters it is advised to use the system provided MPI installation. For more details on this see the official MPI.jl documentation here.

Once MPI.jl has been installed and configured the simulation example described above can be run in parallel on two processes with

mpiexecjl --project -n 2 julia  bin/qxrun.jl -d rqc_7_7_16.qx -o rqc_7_7_16_output.jld2 -m

where the -n 2 specifies the number of processes to use and the -m option enabled MPI.

Measuring the time and speedup

When running the above example we would expect to see a reduction in the time taken when using more processes, however this is not what we observe in practice. This is due to the fact that with Julia when running code for the first time the majority of the time is taken up by compilation. In the above example since we are only calculating amplitudes for 10 bitstrings there are not enough computations after this compilation to observe a speedup. In practice for circuits of this size one would be calculating amplitudes for millions of bitstrings. It is also possible to reduce this startup time by compiling a custom system image as decribed in this documentation for QXContexts.

To see more clearly the time taken for compilation vs computation one can run the examples above using the -t flag. When this flag is used the simulation will be performed twice and timing information collected from each of these runs. (We also add -b 1 which sets the number of BLAS threads to 1 to see clearly the effect of increasing the number of processes without changing the contention between cores that would result when threading is also used.) Using a single process with the following command

mpiexecjl --project -n 1 julia bin/qxrun.jl -d rqc_7_7_16.qx -o rqc_7_7_16_output.jld2 -m  -t -b 1

results in the following timing output

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:          1127914s / 0.01%          15.5GiB / 90.5%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    52.6s  80.5%   52.6s   13.0GiB  93.0%  13.0GiB
 Init sampler               1    10.7s  16.3%   10.7s    759MiB  5.29%   759MiB
   Parse input files        1    7.80s  11.9%   7.80s    562MiB  3.91%   562MiB
   Create Context           1    1.34s  2.05%   1.34s    147MiB  1.03%   147MiB
   Create sampler           1   73.5ms  0.11%  73.5ms   3.88MiB  0.03%  3.88MiB
 Write results              1    2.12s  3.24%   2.12s    251MiB  1.75%   251MiB
 ──────────────────────────────────────────────────────────────────────────────
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:            16.1s / 100%            5.95GiB / 100%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    16.0s   100%   16.0s   5.94GiB  100%   5.94GiB
 Init sampler               1   60.2ms  0.37%  60.2ms   11.8MiB  0.19%  11.8MiB
   Parse input files        1   41.5ms  0.26%  41.5ms   4.80MiB  0.08%  4.80MiB
   Create Context           1   15.6ms  0.10%  15.6ms   6.78MiB  0.11%  6.78MiB
   Create sampler           1   69.3μs  0.00%  69.3μs   2.08KiB  0.00%  2.08KiB
 Write results              1   1.60ms  0.01%  1.60ms   72.1KiB  0.00%  72.1KiB
 ──────────────────────────────────────────────────────────────────────────────

The first set of timings include the precompilation whereas the second set are for the computations alone. From this we see that the simulation part of the code took 16 seconds when using a single process. Running on two processes with

mpiexecjl --project -n 2 julia bin/qxrun.jl -d rqc_7_7_16.qx -o rqc_7_7_16_output.jld2 -m  -t -b 1

we get

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:          1129782s / 0.01%          12.5GiB / 88.2%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    46.5s  78.7%   46.5s   10.1GiB  91.1%  10.1GiB
 Init sampler               1    10.5s  17.7%   10.5s    759MiB  6.72%   759MiB
   Parse input files        1    7.62s  12.9%   7.62s    562MiB  4.97%   562MiB
   Create Context           1    1.36s  2.31%   1.36s    147MiB  1.30%   147MiB
   Create sampler           1   74.0ms  0.13%  74.0ms   3.88MiB  0.03%  3.88MiB
 Write results              1    2.10s  3.57%   2.10s    251MiB  2.22%   251MiB
 ──────────────────────────────────────────────────────────────────────────────
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:            8.53s / 100%            2.98GiB / 100%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    8.45s  99.0%   8.45s   2.97GiB  100%   2.97GiB
 Init sampler               1   79.9ms  0.94%  79.9ms   11.8MiB  0.39%  11.8MiB
   Parse input files        1   38.2ms  0.45%  38.2ms   4.80MiB  0.16%  4.80MiB
   Create Context           1   17.0ms  0.20%  17.0ms   6.78MiB  0.22%  6.78MiB
   Create sampler           1   27.5μs  0.00%  27.5μs   2.08KiB  0.00%  2.08KiB
 Write results              1   2.43ms  0.03%  2.43ms   72.1KiB  0.00%  72.1KiB
 ──────────────────────────────────────────────────────────────────────────────

where we see that the simulation only took 8.45s, almost half the time. Running for four processes with

mpiexecjl --project -n 4 julia bin/qxrun.jl -d rqc_7_7_16.qx -o rqc_7_7_16_output.jld2 -m  -t -b 1

results in

 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:          1128856s / 0.01%          11.0GiB / 86.6%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    55.6s  79.7%   55.6s   8.57GiB  89.7%  8.57GiB
 Init sampler               1    12.0s  17.2%   12.0s    759MiB  7.76%   759MiB
   Parse input files        1    8.58s  12.3%   8.58s    562MiB  5.74%   562MiB
   Create Context           1    1.60s  2.30%   1.60s    147MiB  1.50%   147MiB
   Create sampler           1   88.1ms  0.13%  88.1ms   3.88MiB  0.04%  3.88MiB
 Write results              1    2.11s  3.03%   2.11s    251MiB  2.57%   251MiB
 ──────────────────────────────────────────────────────────────────────────────
 ──────────────────────────────────────────────────────────────────────────────
                                       Time                   Allocations
                               ──────────────────────   ───────────────────────
       Tot / % measured:            4.63s / 100%            1.50GiB / 100%

 Section               ncalls     time   %tot     avg     alloc   %tot      avg
 ──────────────────────────────────────────────────────────────────────────────
 Simulation                 1    4.56s  98.5%   4.56s   1.48GiB  99.2%  1.48GiB
 Init sampler               1   68.9ms  1.49%  68.9ms   11.8MiB  0.77%  11.8MiB
   Parse input files        1   46.3ms  1.00%  46.3ms   4.80MiB  0.31%  4.80MiB
   Create Context           1   17.9ms  0.39%  17.9ms   6.78MiB  0.44%  6.78MiB
   Create sampler           1   22.6μs  0.00%  22.6μs   2.08KiB  0.00%  2.08KiB
 Write results              1   2.01ms  0.04%  2.01ms   72.1KiB  0.00%  72.1KiB
  ──────────────────────────────────────────────────────────────────────────────

which shows the simulation taking 4.56s, almost half again. This shows that we achieve approximately linear speedup in the time taken for the actual simulation with increasing numbers of processes. For some scaling to larger numbers of processes see the results included in arXiv:2110.09894.