Improve `SET_VAR_AT_COMPILE_TIME` clarity

## Context

The existing 
```cpp
SET_VAR_AT_COMPILE_TIME(type, newSymbol, templateSymbol, runtimeSymbol)
```
macro is used by both the CPU and GPU backends for compile-time optimisation: it sets `newSymbol := templateSymbol` (at compile-time) when `templateSymbol` is not equal to its sentinel "OFF" value of `-1`, and otherwise falls back to `newSymbol := runtimeSymbol` (resolving at run-time). 

https://github.com/QuEST-Kit/QuEST/blob/922c555825ca0c75a8b3d76d4757c048f3bcd9f6/quest/src/core/accelerator.hpp#L128-L145

This is (in my humble opinion) a beautifully succinct way to specialize and accelerate functions for fixed input sizes, at the cost of increased binary size. Just look how the below function has _one_ definition, yet _seven_ compiled, bespoke instantiations!

https://github.com/QuEST-Kit/QuEST/blob/922c555825ca0c75a8b3d76d4757c048f3bcd9f6/quest/src/cpu/cpu_subroutines.cpp#L1148-L1172

Sometimes however, the quantity which will appear in the (to-be-unrolled) loops, which we _actually_ want to be compile-time resolve, is the _sum_ of the template parameter with another `constexpr` expression. This happens, for example, when the control qubit list becomes intermingled with the target qubit list. For example...

https://github.com/QuEST-Kit/QuEST/blob/922c555825ca0c75a8b3d76d4757c048f3bcd9f6/quest/src/cpu/cpu_subroutines.cpp#L294-L321

The above establishes sometimes-compile-time-known `numQubitBits` via:
```cpp
 SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size()); 
 int numQubitBits = numCtrlBits + 2; 
```
and whereafter, `numCtrlBits` is never again used. It is tempting but **invalid** to replace these two lines with a single line
```cpp
 SET_VAR_AT_COMPILE_TIME(int, numQubitsBits, NumCtrls + 2, ctrls.size() + 2); 
```
because this breaks our detection of `NumCtrls == -1` (the sentinel value), and fall-back to `ctrls.size() + 2`. 

## Problem

The "two line solution" above is arguably fine for the CPU backend, but causes additional ugliness in the GPU backend. This is because the kernels in `gpu_kernels.cuh` necessarily receive the control and target qubits pre-sorted in a _single_ merged list. So there is an additional `-2` floating around the macro! For example, the corresponding GPU kernel to the above function [here](https://github.com/JPRichings/QuEST/blob/71268dc7306cc83cfa0281373eb9689f71e7151d/quest/src/gpu/gpu_kernels.cuh#L129-L147) receives a single `ctrlsAndTargs` list:
```cpp
template <int NumCtrls> 
__global__ void kernel_statevec_anyCtrlSwap_subA(
    gpu_qcomp* amps, qindex numThreads, 
    _GRID_CONST_OPT const List64 ctrlsAndTargs, qindex ctrlsAndTargsMask, 
    int targ1, int targ2
) {
    GET_THREAD_IND(n, numThreads);

    // beware ctrlsAndTargs contains the two targets
    constexpr int numTargs = 2;

    // use template param to compile-time unroll loop in insertBits()
    SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
    int numQubitBits = numCtrlBits + numTargs;

    // i01 = nth local index where ctrls are active, targ2=0 and targ1=1
    qindex i01 = insertBitsWithMaskedValues(n, ctrlsAndTargs.data(), numQubitBits, ctrlsAndTargsMask);
    qindex i10 = flipTwoBits(i01, targ2, targ1);

    // swap amps
    gpu_qcomp amp01 = amps[i01];
    amps[i01] = amps[i10];
    amps[i10] = amp01;
}
```

So now it takes _three_ lines to establish `numQubitBits`;
```cpp
constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
int numQubitBits = numCtrlBits + numTargs;
```
Of course this can be shrunk to an even less-readable two-lines:
```cpp
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - 2);
int numQubitBits = numCtrlBits + 2;
```

This is a very confusing variable initialisation process, which is liable to bugs and off-by-one errors; and it's likely responsible for [this](https://github.com/QuEST-Kit/QuEST/pull/739#discussion_r3449768957) narrowly avoided bug which would not have been covered by our unit tests, eep! 


## Solution

The reason why we cannot reduce variable initialisation to a one-liner, is because the `templateSymbol` slot in `SET_VAR_AT_COMPILE_TIME` is acting as _both_ an indicator of whether or not to use the template value (indicated by being `-1` or otherwise), _and_ the actual template value (when being used).

The "smallest" fix is to change (or define a new macro) which simply _separates_ those responsibilities to different slots.

```cpp
 #define SET_VAR_AT_COMPILE_TIME(type, name, usecompileval, compileval, runtimeval) \ 
     type name; \ 
     if constexpr (usecompileval) \ 
         name = compileval; \ 
     else \ 
         name = runtimeval;
```

With this change, the three-liner above of
```cpp
constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrlsAndTargs.size() - numTargs);
int numBits = numCtrlBits + numTargs;
```
becomes two-liner
```cpp
constexpr int numTargs = 2;
SET_VAR_AT_COMPILE_TIME(int, numBits , NumCtrls!=-1, NumCtrls+numTargs, ctrlsAndTargs.size())
```
or even a defensible one-liner
```cpp
SET_VAR_AT_COMPILE_TIME(int, numBits, NumCtrls!=-1, NumCtrls+2, ctrlsAndTargs.size())
```

This remains unsatisfying however, and may be just as unclear as foregoing the macro entirely!
```cpp
int numBits;
if constexpr (NumCtrls == -1)
    numBits = ctrlsAndTargs.size();
else
    numBits = NumCtrls + 2; // 2 targs
```

Find a clean solution to improve readability of `gpu_kernels.cuh`, which can also be used by `cpu_subroutines.cpp`. 

	/*
	* COMPILE-TIME VARIABLE MACROS
	*
	* used by cpu_subroutines.cpp and gpu_subroutines to attemptedly set
	* a variable to a value known at compile-time (like a templated function's
	* parameter), enabling compile-time optimisations of subsequent code which
	* uses the variable such a loop unrolling. If the value is not known at
	* compile-time (compileval==-1 which indicates a templated function has
	* been called with more qubits than it has been explicitly instantiated and
	* optimised for), the runtime value is used, precluding optimisations.
	*/

	#define SET_VAR_AT_COMPILE_TIME(type, name, compileval, runtimeval) \
	type name; \
	if constexpr (compileval == -1) \
	name = runtimeval; \
	else \
	name = compileval;

	template <int NumQuregs>
	void cpu_statevec_setQuregToWeightedSum_sub(Qureg outQureg, vector<qcomp> coeffs, vector<Qureg> inQuregs) {

	// use cpu_qcomp arithmetic overloads (avoid qcomp's)
	cpu_qcomp* outAmps = getCpuQcompPtr(outQureg.cpuAmps);
	cpu_qcomp* inFacs = getCpuQcompPtr(coeffs.data());

	qindex numIts = outQureg.numAmpsPerNode;

	// use template param to compile-time unroll inner loop below
	SET_VAR_AT_COMPILE_TIME(int, numQuregs, NumQuregs, inQuregs.size());

	#pragma omp parallel for if(outQureg.isMultithreaded)
	for (qindex n=0; n<numIts; n++) {

	// unrolled when inQuregs.size() <= 5
	cpu_qcomp amp = getCpuQcomp(0, 0);
	for (int q=0; q<numQuregs; q++)
	amp += inFacs[q] * getCpuQcompPtr(inQuregs[q].cpuAmps)[n];

	// must not modify cpuAmps[n] before computing the amp since
	// outQureg can legally appear among inQuregs
	outAmps[n] = amp;
	}
	}

	template <int NumCtrls>
	void cpu_statevec_anyCtrlSwap_subA(Qureg qureg, ConstList64 ctrls, ConstList64 ctrlStates, int targ1, int targ2) {

	assert_numCtrlsMatchesNumCtrlStatesAndTemplateParam(ctrls.size(), ctrlStates.size(), NumCtrls);

	// use cpu_qcomp (in lieu of qcomp) even though no arithmetic happens below - just for consistency!
	cpu_qcomp* amps = getCpuQcompPtr(qureg.cpuAmps);

	// each control qubit halves the number of iterations, each of which modifies 2 amplitudes, and skips 2
	qindex numIts = qureg.numAmpsPerNode / powerOf2(2 + ctrls.size());

	auto sortedQubits = util_getSorted(ctrls, {targ2, targ1});
	auto qubitStateMask = util_getBitMask(ctrls, ctrlStates, {targ2, targ1}, {0, 1});

	// use template param to compile-time unroll loop in insertBits()
	SET_VAR_AT_COMPILE_TIME(int, numCtrlBits, NumCtrls, ctrls.size());
	int numQubitBits = numCtrlBits + 2;

	#pragma omp parallel for if(qureg.isMultithreaded)
	for (qindex n=0; n<numIts; n++) {

	// i01 = nth local index where ctrls are active, targ2=0 and targ1=1
	qindex i01 = insertBitsWithMaskedValues(n, sortedQubits.data(), numQubitBits, qubitStateMask);
	qindex i10 = flipTwoBits(i01, targ2, targ1);

	std::swap(amps[i01], amps[i10]);
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve `SET_VAR_AT_COMPILE_TIME` clarity #802

Context

Problem

Solution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve SET_VAR_AT_COMPILE_TIME clarity #802

Description

Context

Problem

Solution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Improve `SET_VAR_AT_COMPILE_TIME` clarity #802