Render on nbviewer Launch Binder

Benchmarks

In [1]:
# NBVAL_IGNORE_OUTPUT
%load_ext watermark
import os
import qutip
import numpy as np
from numpy import pi
import matplotlib
import matplotlib.pylab as plt
import newtonprop
import snakeviz
import numba
%watermark -v --iversions
qutip       4.3.1
numpy       1.15.4
matplotlib  3.0.2
matplotlib.pylab  1.15.4
newtonprop  0.1.0
snakeviz    1.0.0
numba       0.41.0
CPython 3.6.7
IPython 7.2.0

While only available interactively, snakeviz is a very good way to look at profiling data:

In [2]:
%load_ext snakeviz

The benchmarks are based on the Examples, so we rerun the entire Example notebook in this context:

In [3]:
%%capture
%run example.ipynb

Propagation runtime

In [4]:
tlist = np.linspace(0, 10, 100)

First, we measure low long the propagation with QuTiP takes. This happens mostly in compiled code, so it is pretty fast.

In [5]:
%%timeit
qutip.mesolve(L, rho0, tlist)
15.6 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [6]:
%%timeit
propagate_expm(L, rho0, tlist)
39.1 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The Newton propagator, being a reference implementation, is implemented in pure Python, and is several orders of magnitude slower. To make the comparison fair, we limit the precision to \(10^{-8}\), which is roughly the precision of mesolve.

In [7]:
%%timeit
propagate(L, rho0, tlist, zero_qutip, norm_qutip, inner_qutip, tol=1e-8)
3.85 s ± 282 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

When using lower-level data types, things get considerably faster:

In [8]:
%%timeit
propagate(apply_cythonized_L, rho0_data, tlist, zero_vectorized, norm_vectorized, inner_vectorized, tol=1e-8)
209 ms ± 6.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [9]:
%%timeit
propagate(L_vectorized, rho0_vectorized, tlist, zero_vectorized, norm_vectorized, inner_vectorized, tol=1e-8)
210 ms ± 14.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Profiling

We can profile the how much time is spent in the various routines, comparing mesolve and different variations of the Newton propagator. See https://docs.python.org/3/library/profile.html#instant-user-s-manual for the meaning of the colums.

mesolve

First, we look the QuTiP’s mesolve:

In [10]:
stats = %prun -q -r qutip.mesolve(L, rho0, tlist);

We can look at which top-level routines we spent the most time in cumulativly, that is including sub-calls:

In [11]:
stats.sort_stats('cumtime').print_stats(10);
         8378 function calls (8174 primitive calls) in 0.023 seconds

   Ordered by: cumulative time
   List reduced from 105 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.023    0.023 {built-in method builtins.exec}
        1    0.000    0.000    0.023    0.023 <string>:1(<module>)
        1    0.000    0.000    0.023    0.023 mesolve.py:80(mesolve)
        1    0.000    0.000    0.023    0.023 mesolve.py:775(_mesolve_const)
        1    0.001    0.001    0.023    0.023 mesolve.py:985(_generic_ode_solve)
       99    0.000    0.000    0.009    0.000 _ode.py:396(integrate)
       99    0.008    0.000    0.008    0.000 _ode.py:989(run)
      102    0.001    0.000    0.008    0.000 qobj.py:211(__init__)
      112    0.000    0.000    0.004    0.000 qobj.py:1927(type)
      310    0.001    0.000    0.004    0.000 fromnumeric.py:64(_wrapreduction)


Or, the bottom-level routines where we actually spent time:

In [12]:
stats.sort_stats('tottime').print_stats(10);
         8378 function calls (8174 primitive calls) in 0.023 seconds

   Ordered by: internal time
   List reduced from 105 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       99    0.008    0.000    0.008    0.000 _ode.py:989(run)
      310    0.002    0.000    0.002    0.000 {method 'reduce' of 'numpy.ufunc' objects}
      100    0.002    0.000    0.003    0.000 {qutip.cy.spconvert.dense2D_to_fastcsr_fmode}
        1    0.001    0.001    0.023    0.023 mesolve.py:985(_generic_ode_solve)
      611    0.001    0.000    0.001    0.000 {built-in method numpy.core.multiarray.array}
      203    0.001    0.000    0.003    0.000 fastsparse.py:47(__init__)
      310    0.001    0.000    0.004    0.000 fromnumeric.py:64(_wrapreduction)
      100    0.001    0.000    0.001    0.000 superoperator.py:227(vec2mat)
      102    0.001    0.000    0.008    0.000 qobj.py:211(__init__)
      301    0.001    0.000    0.001    0.000 qobj.py:1934(shape)


This is dominated by the ODE solver and sparse matrix operations

If we’re working interactively, we could use snakeviz to analyze further details:

In [13]:
#%snakeviz qutip.mesolve(L, rho0, tlist)

qutip-propagation

Next, we look at the Newton propagator operating on high-level qutip objects:

In [14]:
#%snakeviz propagate(L, rho0, tlist, zero_qutip, norm_qutip, inner_qutip, tol=1e-8)
In [15]:
stats = %prun -q -r propagate(L, rho0, tlist, zero_qutip, norm_qutip, inner_qutip, tol=1e-8);

In [16]:
stats.sort_stats('cumtime').print_stats(10);
         3902775 function calls (3810705 primitive calls) in 4.882 seconds

   Ordered by: cumulative time
   List reduced from 182 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.882    4.882 {built-in method builtins.exec}
        1    0.000    0.000    4.882    4.882 <string>:1(<module>)
        1    0.000    0.000    4.882    4.882 <ipython-input-3-fc96ec8d0dfb>:10(propagate)
       99    0.006    0.000    4.881    0.049 propagator.py:186(step)
       99    0.052    0.001    4.874    0.049 propagator.py:314(_step)
       99    0.137    0.001    3.991    0.040 propagator.py:26(_arnoldi)
    35937    0.141    0.000    1.640    0.000 qobj.py:211(__init__)
     5445    0.019    0.000    1.583    0.000 qobj.py:465(__sub__)
     7623    0.084    0.000    1.568    0.000 qobj.py:360(__add__)
    47817    0.042    0.000    1.288    0.000 qobj.py:1927(type)


In [17]:
stats.sort_stats('tottime').print_stats(10);
         3902775 function calls (3810705 primitive calls) in 4.882 seconds

   Ordered by: internal time
   List reduced from 182 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   110385    0.552    0.000    0.552    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    74547    0.357    0.000    0.737    0.000 fastsparse.py:47(__init__)
   272547    0.289    0.000    0.289    0.000 {built-in method numpy.core.multiarray.array}
   109395    0.233    0.000    0.860    0.000 fromnumeric.py:64(_wrapreduction)
     7623    0.157    0.000    0.509    0.000 fastsparse.py:74(_binopt)
    35937    0.141    0.000    1.640    0.000 qobj.py:211(__init__)
     6435    0.139    0.000    0.217    0.000 {qutip.cy.spmath.zcsr_mult}
       99    0.137    0.001    3.991    0.040 propagator.py:26(_arnoldi)
    91476    0.122    0.000    1.106    0.000 dimensions.py:44(is_scalar)
   548262    0.117    0.000    0.162    0.000 {built-in method builtins.isinstance}


cythonized qutip-propagation

Lastly, we can look at the more efficient propagation using a cythonized application of the QuTiP objects:

In [18]:
#%snakeviz propagate(apply_cythonized_L, rho0_data, tlist, zero_vectorized, norm_vectorized, inner_vectorized, tol=1e-8)
In [19]:
stats = %prun -q -r propagate(apply_cythonized_L, rho0_data, tlist, zero_vectorized, norm_vectorized, inner_vectorized, tol=1e-8);

In [20]:
stats.sort_stats('cumtime').print_stats(10);
         95441 function calls in 0.289 seconds

   Ordered by: cumulative time
   List reduced from 72 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.289    0.289 {built-in method builtins.exec}
        1    0.000    0.000    0.289    0.289 <string>:1(<module>)
        1    0.000    0.000    0.289    0.289 <ipython-input-3-fc96ec8d0dfb>:10(propagate)
       99    0.002    0.000    0.289    0.003 propagator.py:186(step)
       99    0.041    0.000    0.284    0.003 propagator.py:314(_step)
       99    0.067    0.001    0.201    0.002 propagator.py:26(_arnoldi)
      990    0.033    0.000    0.058    0.000 linalg.py:953(eigvals)
     1386    0.013    0.000    0.038    0.000 linalg.py:2203(norm)
     1287    0.002    0.000    0.036    0.000 <ipython-input-3-718732ac07f2>:4(norm_vectorized)
     3762    0.028    0.000    0.029    0.000 {built-in method numpy.core.multiarray.dot}


In [21]:
stats.sort_stats('tottime').print_stats(10);
         95441 function calls in 0.289 seconds

   Ordered by: internal time
   List reduced from 72 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       99    0.067    0.001    0.201    0.002 propagator.py:26(_arnoldi)
       99    0.041    0.000    0.284    0.003 propagator.py:314(_step)
      990    0.033    0.000    0.058    0.000 linalg.py:953(eigvals)
     3762    0.028    0.000    0.029    0.000 {built-in method numpy.core.multiarray.dot}
      990    0.016    0.000    0.016    0.000 {built-in method qutip.cy.spmatfuncs.cy_ode_rhs}
     1386    0.013    0.000    0.038    0.000 linalg.py:2203(norm)
     8613    0.011    0.000    0.014    0.000 defmatrix.py:186(__getitem__)
     5445    0.008    0.000    0.008    0.000 {built-in method numpy.core.multiarray.vdot}
       99    0.008    0.000    0.008    0.000 propagator.py:142(_extend_leja)
     1089    0.006    0.000    0.006    0.000 {method 'reduce' of 'numpy.ufunc' objects}


We now see that the (Python-level) arnoldi and step routines of Newton come out as the bottle neck, as the application of the Liouvillian has become efficient level (running at C speed).

Note that the Newton implementation, in particular the _extend_leja function has been sped up through the use of numba. Without that, _extend_leja would dominate.

Number of function evaluations

Since the Newton propagator is ultimately still limited by being implemented in Python, it is more fair to measure the runtime in terms of the number of average number of applications of the Liouvillian per time step. This is under the assumption that in an efficient implementation (and for large Hilbert spaces), this is the dominating factor.

The number of applications depends on the chosen precision and on the length of the time step: longer time steps tend do be more efficient, as only then we’re in a regime where the fast convergence of the Newton series kicks in.

We construct a dummy Qobj that counts its own applications to a state:

In [22]:
class CountingQobj(qutip.Qobj):
    def __init__(self, *args, **kwargs):
        self.counter = 0
        super().__init__(*args, **kwargs)
    def __mul__(self, other):
        if isinstance(other, qutip.Qobj):
            self.counter += 1
        return super().__mul__(other)
In [23]:
def count_applications_newton(nt, tol=1e-8):
    tlist = np.linspace(0, 10, nt)
    L_count = CountingQobj(L)
    propagate(L_count, rho0, tlist, zero_qutip, norm_qutip, inner_qutip, tol=1e-8)
    return L_count.counter / len(tlist)
In [24]:
count_applications_newton(nt=10)
Out[24]:
23.0
In [25]:
count_applications_newton(nt=100)
Out[25]:
9.9

To compare this to the average number of applications in mesolve, we use a trimmed-down version of the mesolve routine:

In [26]:
import scipy
from qutip.solver import Options
from qutip.superoperator import mat2vec
from qutip.cy.spmatfuncs import cy_ode_rhs
from qutip.mesolve import _generic_ode_solve
from qutip.ui.progressbar import BaseProgressBar


def mesolve(L, rho0, tlist):
    opt = Options()

    def func(t, rho, data, ind, ptr):
        func.counter += 1
        return cy_ode_rhs(t, rho, data, ind, ptr)

    func.counter = 0

    r = scipy.integrate.ode(func)
    r.set_f_params(L.data.data, L.data.indices, L.data.indptr)
    r.set_integrator('zvode', method=opt.method, order=opt.order,
                     atol=opt.atol, rtol=opt.rtol, nsteps=opt.nsteps,
                     first_step=opt.first_step, min_step=opt.min_step,
                     max_step=opt.max_step)
    initial_vector = mat2vec(rho0.full()).ravel('F')
    r.set_initial_value(initial_vector, tlist[0])
    dt = tlist[1] - tlist[0]
    for step in range(len(tlist)-1):
        r.integrate(r.t + dt)
    return func.counter


In [27]:
def count_applications_mesolve(nt):
    tlist = np.linspace(0, 10, nt)
    return mesolve(L, rho0, tlist) / len(tlist)
In [28]:
count_applications_mesolve(10)
Out[28]:
40.7
In [29]:
count_applications_mesolve(100)
Out[29]:
4.07