Skip to content

Improve SET_VAR_AT_COMPILE_TIME clarity #802

Description

@TysonRayJones

Context

The existing

SET_VAR_AT_COMPILE_TIME(type, newSymbol, templateSymbol, runtimeSymbol)

macro is used by both the CPU and GPU backends for compile-time optimisation: it sets newSymbol := templateSymbol (at compile-time) when templateSymbol is not equal to its sentinel "OFF" value of -1, and otherwise falls back to newSymbol := runtimeSymbol (resolving at run-time).

/*
* COMPILE-TIME VARIABLE MACROS
*
* used by cpu_subroutines.cpp and gpu_subroutines to attemptedly set
* a variable to a value known at compile-time (like a templated function's
* parameter), enabling compile-time optimisations of subsequent code which
* uses the variable such a loop unrolling. If the value is not known at
* compile-time (compileval==-1 which indicates a templated function has
* been called with more qubits than it has been explicitly instantiated and
* optimised for), the runtime value is used, precluding optimisations.
*/
#define SET_VAR_AT_COMPILE_TIME(type, name, compileval, runtimeval) \
type name; \
if constexpr (compileval == -1) \
name = runtimeval; \
else \
name = compileval;

This is (in my humble opinion) a beautifully succinct way to specialize and accelerate functions for fixed input sizes, at the cost of increased binary size. Just look how the below function has one definition, yet seven compiled, bespoke instantiations!

template <int NumQuregs>
void cpu_statevec_setQuregToWeightedSum_sub(Qureg outQureg, vector<qcomp> coeffs, vector<Qureg> inQuregs) {
// use cpu_qcomp arithmetic overloads (avoid qcomp's)
cpu_qcomp* outAmps = getCpuQcompPtr(outQureg.cpuAmps);
cpu_qcomp* inFacs = getCpuQcompPtr(coeffs.data());
qindex numIts = outQureg.numAmpsPerNode;
// use template param to compile-time unroll inner loop below
SET_VAR_AT_COMPILE_TIME(int, numQuregs, NumQuregs, inQuregs.size());
#pragma omp parallel for if(outQureg.isMultithreaded)
for (qindex n=0; n<numIts; n++) {
// unrolled when inQuregs.size() <= 5
cpu_qcomp amp = getCpuQcomp(0, 0);
for (int q=0; q<numQuregs; q++)
amp += inFacs[q] * getCpuQcompPtr(inQuregs[q].cpuAmps)[n];
// must not modify cpuAmps[n] before computing the amp since
// outQureg can legally appear among inQuregs
outAmps[n] = amp;
}
}

Sometimes however, the quantity which will appear in the (to-be-unrolled) loops, which we actually want to be compile-time resolve, is the sum of the template parameter with another constexpr expression. This happens, for example, when the control qubit list becomes intermingled with the target qubit list. For example...

template <int NumCtrls>
void cpu_statevec_anyCtrlSwap_subA(Qureg qureg, ConstList64 ctrls, ConstList64 ctrlStates, int targ1, int targ2) {
assert_numCtrlsMatchesNumCtrlStatesAndTemplateParam(ctrls.size(), ctrlStates.size(), NumCtrls);
// use cpu_qcomp (in lieu of qcomp) even though no arithmetic happens below - just for consistency!
cpu_qcomp* amps = getCpuQcompPtr(qureg.cpuAmps);
// each control qubit halves the number of iterations, each of which modifies 2 amplitudes, and skips 2
qindex numIts = qureg.numAmpsPerNode / powerOf2(2 + ctrls.size());
auto sortedQubits = util_getSorted(ctrls, {targ2, targ1});
auto qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ2, targ1}, {0, 1});
// use template param to compile-time unroll loop in insertBits()
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size());
int numQubitBits = numCtrlBits + 2;
#pragma omp parallel for if(qureg.isMultithreaded)
for (qindex n=0; n<numIts; n++) {
// i01 = nth local index where ctrls are active, targ2=0 and targ1=1
qindex i01 = insertBitsWithMaskedValues(n, sortedQubits.data(), numQubitBits, qubitStateMask);
qindex i10 = flipTwoBits(i01, targ2, targ1);
std::swap(amps[i01], amps[i10]);
}
}

The above establishes sometimes-compile-time-known numQubitBits via:

 SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size()); 
 int numQubitBits = numCtrlBits + 2; 

and whereafter, numCtrlBits is never again used. It is tempting but invalid to replace these two lines with a single line

 SET_VAR_AT_COMPILE_TIME(int, numQubitsBits, NumCtrls + 2, ctrls.size() + 2); 

because this breaks our detection of NumCtrls == -1 (the sentinel value), and fall-back to ctrls.size() + 2.

Problem

The "two line solution" above is arguably fine for the CPU backend, but causes additional ugliness in the GPU backend. This is because the kernels in gpu_kernels.cuh necessarily receive the control and target qubits pre-sorted in a single merged list. So there is an additional -2 floating around the macro! For example, the corresponding GPU kernel to the above function here receives a single ctrlsAndTargs list:

template <int NumCtrls> 
__global__ void kernel_statevec_anyCtrlSwap_subA(
    gpu_qcomp* amps, qindex numThreads, 
    _GRID_CONST_OPT const List64 ctrlsAndTargs, qindex ctrlsAndTargsMask, 
    int targ1, int targ2
) {
    GET_THREAD_IND(n, numThreads);

    // beware ctrlsAndTargs contains the two targets
    constexpr int numTargs = 2;

    // use template param to compile-time unroll loop in insertBits()
    SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
    int numQubitBits = numCtrlBits + numTargs;

    // i01 = nth local index where ctrls are active, targ2=0 and targ1=1
    qindex i01 = insertBitsWithMaskedValues(n, ctrlsAndTargs.data(), numQubitBits, ctrlsAndTargsMask);
    qindex i10 = flipTwoBits(i01, targ2, targ1);

    // swap amps
    gpu_qcomp amp01 = amps[i01];
    amps[i01] = amps[i10];
    amps[i10] = amp01;
}

So now it takes three lines to establish numQubitBits;

constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
int numQubitBits = numCtrlBits + numTargs;

Of course this can be shrunk to an even less-readable two-lines:

SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - 2);
int numQubitBits = numCtrlBits + 2;

This is a very confusing variable initialisation process, which is liable to bugs and off-by-one errors; and it's likely responsible for this narrowly avoided bug which would not have been covered by our unit tests, eep!

Solution

The reason why we cannot reduce variable initialisation to a one-liner, is because the templateSymbol slot in SET_VAR_AT_COMPILE_TIME is acting as both an indicator of whether or not to use the template value (indicated by being -1 or otherwise), and the actual template value (when being used).

The "smallest" fix is to change (or define a new macro) which simply separates those responsibilities to different slots.

 #define SET_VAR_AT_COMPILE_TIME(type, name, usecompileval, compileval, runtimeval) \ 
     type name; \ 
     if constexpr (usecompileval) \ 
         name = compileval; \ 
     else \ 
         name = runtimeval;

With this change, the three-liner above of

constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
int numBits = numCtrlBits + numTargs;

becomes two-liner

constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numBits , NumCtrls!=-1, NumCtrls+numTargs, ctrlsAndTargs.size())

or even a defensible one-liner

SET_VAR_AT_COMPILE_TIME(int, numBits, NumCtrls!=-1, NumCtrls+2, ctrlsAndTargs.size())

This remains unsatisfying however, and may be just as unclear as foregoing the macro entirely!

int numBits;
if constexpr (NumCtrls == -1)
    numBits = ctrlsAndTargs.size();
else
    numBits = NumCtrls + 2; // 2 targs

Find a clean solution to improve readability of gpu_kernels.cuh, which can also be used by cpu_subroutines.cpp.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions