Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 258 additions & 17 deletions lectures/numba.md
Original file line number Diff line number Diff line change
Expand Up @@ -436,6 +436,17 @@ For GPU-based parallelization, see our {doc}`lectures on JAX <jax_intro>`.

## Exercises

{ref}`speed_ex1` and {ref}`numba_ex3` both estimate $\pi$ by Monte Carlo from random samples in the unit square.

We generate them here and store them in `u_draws` and `v_draws` so that we can use them in both exercises and compare results

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pedagogical placement + nit. Generating u_draws/v_draws here, at the top of the Exercises section before speed_ex1 is even posed, partly gives away the approach the student is asked to discover. Consider introducing the shared draws closer to where they're first consumed (or framing them as "we'll reuse these in the solutions").

Also a missing period at the end of this line: "...compare results**.**"


```{code-cell} ipython3
n = 1_000_000
rng = np.random.default_rng()
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)
```

```{exercise}
:label: speed_ex1

Expand All @@ -455,10 +466,11 @@ Here is one solution:

```{code-cell} ipython3
@jit
def calculate_pi(n=1_000_000):
def calculate_pi(u_draws, v_draws):
n = len(u_draws)
count = 0
for i in range(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
u, v = u_draws[i], v_draws[i]
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
Expand All @@ -471,19 +483,61 @@ Now let's see how fast it runs:

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

If we switch off JIT compilation by removing `@jit`, the code takes around

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded timing number is likely stale now. The old calculate_pi drew randoms inside the loop (np.random.uniform(...) calls); the new version just indexes into pre-drawn arrays. Removing the RNG calls from the loop body changes the no-JIT Python baseline substantially, so the "~150×" ratio this sentence quotes — and the "2 orders of magnitude" claim two lines down — were measured against the old code and are probably no longer accurate.

More broadly, hard-coding exact multipliers into the lecture is fragile: they drift silently on every NumPy/Numba/hardware bump. Recommend re-measuring under the new code and softening to qualitative language, e.g. "more than two orders of magnitude faster on our machine" / "dramatically faster", rather than a precise figure.

150 times as long on our machine.

So we get a speed gain of 2 orders of magnitude by adding four characters.

The solution above takes one of two natural approaches: it *draws all the
random points first*, stores them in `u_draws` and `v_draws`, and then lets the
jitted function loop over them.

The other approach is to *draw each point inside the loop*.

To do this with a NumPy `Generator`, we pass `rng` in as an argument and call `rng.uniform()` inside the loop body

```{code-cell} ipython3
@jit
def calculate_pi_in_loop(rng, n):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name reuse is fragile/confusing. calculate_pi_in_loop is defined here as serial @jit, then redefined as @jit(parallel=True) in numba_ex_race. Likewise calculate_pi is serial in speed_ex1 and redefined parallel in numba_ex3. A reader jumping between dropdown solutions can't tell which version is live, and numba_ex_draw_speed (below) silently depends on the redefinition order. Consider distinct names (calculate_pi_parallel, calculate_pi_in_loop_parallel, ...).

count = 0
for i in range(n):
u, v = rng.uniform(), rng.uniform()
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi_in_loop(rng, n)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfair timing comparison. calculate_pi(u_draws, v_draws) above is timed twice (so the 2nd run excludes JIT compilation), but calculate_pi_in_loop is timed once — so its single measurement includes compilation overhead. The prose then claims the two "run at similar speed," which this methodology can't reliably support. Add an untimed warm-up call first, or time it twice like the other.

```

In this serial setting the two approaches give equally good estimates and run at
similar speed, but they are not equivalent in *memory use*.

The first approach
must hold all $2n$ draws in memory at once --- two arrays of `n` floating point
numbers, or about `16n` bytes (around $1.6$ GB when `n = 100_000_000`).

The
second draws each point on demand and discards it, so its memory footprint does
not grow with `n`.

This might suggest that drawing inside the loop is the better default.

But as we

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting. There's a stray leading space before "But as we" (renders oddly), and the surrounding sentences are split across paragraph boundaries so paragraphs begin with a lone "The" ("The first approach / must hold...", "The / second draws..."). Please reflow these into normal paragraphs without the mid-sentence hard breaks.

will see in {ref}`numba_ex_race`, drawing inside the loop interacts
badly with parallelization.

```{solution-end}
```

Expand Down Expand Up @@ -550,10 +604,13 @@ p, q = 0.1, 0.2 # Prob of leaving low and high state respectively
Here's a pure Python version of the function

```{code-cell} ipython3
def compute_series(n):
n = 1_000_000
rng = np.random.default_rng()
U = rng.uniform(0, 1, size=n)

def compute_series(n, U):
x = np.empty(n, dtype=np.int64)
x[0] = 1 # Start in state 1
U = np.random.uniform(0, 1, size=n)
for t in range(1, n):
current_x = x[t-1]
if current_x == 0:
Comment thread
HumphreyYang marked this conversation as resolved.
Expand All @@ -567,8 +624,7 @@ Let's run this code and check that the fraction of time spent in the low
state is about 0.666

```{code-cell} ipython3
n = 1_000_000
x = compute_series(n)
x = compute_series(n, U)
print(np.mean(x == 0)) # Fraction of time x is in state 0
```

Expand All @@ -578,7 +634,7 @@ Now let's time it:

```{code-cell} ipython3
with qe.Timer():
compute_series(n)
compute_series(n, U)
```

Next let's implement a Numba version, which is easy
Expand All @@ -590,15 +646,15 @@ compute_series_numba = jit(compute_series)
Let's check we still get the right numbers

```{code-cell} ipython3
x = compute_series_numba(n)
x = compute_series_numba(n, U)
print(np.mean(x == 0))
```

Let's see the time

```{code-cell} ipython3
with qe.Timer():
compute_series_numba(n)
compute_series_numba(n, U)
```

This is a nice speed improvement for one line of code!
Expand Down Expand Up @@ -637,10 +693,11 @@ Here is one solution:

```{code-cell} ipython3
@jit(parallel=True)
def calculate_pi(n=1_000_000):
def calculate_pi(u_draws, v_draws):
n = len(u_draws)
count = 0
for i in prange(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
u, v = u_draws[i], v_draws[i]
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
Expand All @@ -653,12 +710,12 @@ Now let's see how fast it runs:

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

```{code-cell} ipython3
with qe.Timer():
calculate_pi()
calculate_pi(u_draws, v_draws)
```

By switching parallelization on and off (selecting `True` or
Expand All @@ -671,6 +728,187 @@ a factor of 2 or 3.
(If you are executing locally, you will get different numbers, depending mainly
on the number of CPUs on your machine.)

Notice that we drew all of the random points *before* the loop and passed them in

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hard-coded-number concern as in speed_ex1: the "a factor of 2 or 3" speed-up quoted just above (line ~726) is a precise figure baked into the text. It's heavily machine/thread-dependent and easy to leave stale. Suggest qualitative phrasing ("a modest speed-up, depending on your core count").

as arrays, so the parallel loop only *reads* from memory.

Drawing the points *inside* the parallel loop instead is surprisingly delicate.


We investigate why, and how to do it safely, in
{ref}`numba_ex_race`.

```{solution-end}
```


```{exercise}
:label: numba_ex_race

In {ref}`numba_ex3` we drew all of the random points *before* the parallel loop.

It is tempting to instead draw each point *inside* the `prange` loop, by passing a generator `rng` in as an argument and calling `rng.uniform()` in the loop body.

Try it: the code should run and return a number close to $\pi$, yet there is a subtle bug in this approach.

Investigate as follows:

1. Call your function a few times with the *same* seed and check whether the result is reproducible.
2. Repeat the estimate many times across a range of sample sizes and compare its spread against a correct parallel version.

Then explain what goes wrong and give a correct way to draw inside a parallel loop.

Hint: try to use a legacy random function such as `np.random.uniform()` instead of a `Generator` and see what happens.
```

```{solution-start} numba_ex_race
:class: dropdown
```

Here is the tempting version.

We pass `rng` in as an argument and call it inside the `prange` loop.

```{code-cell} ipython3
n = 1_000_000
rng = np.random.default_rng()

@jit(parallel=True)
def calculate_pi_in_loop(rng, n):
count = 0
for i in prange(n):
u, v = rng.uniform(), rng.uniform()
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4

calculate_pi_in_loop(rng, n)
```

The code runs without error and returns something close to $\pi$.

But something is silently wrong with the results.

Here, every thread draws from the *same* generator `rng`.

A generator produces each number by updating an internal state.

Under `prange`, many threads read and update that single state at once, with no coordination between them.

This is a [**data race**](https://docs.oracle.com/cd/E19205-01/820-0619/geojs/index.html).

It creates correlations across the draws and can even cause some draws to be duplicated in an
unpredictable way.

Two symptoms reveal the problem.

*Symptom 1: the result is no longer reproducible.*

A correct generator returns the same answer whenever it is given the same seed.

Because of the data race, the order in which threads happen to touch the shared state affects the stream of draws, so the answer is not reproducible even when the seed is fixed.

```{code-cell} ipython3
for seed in (1, 1, 1):
print(calculate_pi_in_loop(np.random.default_rng(seed), n))
```

Each call uses the same seed, yet the answers differ.

*Symptom 2: the estimator is far noisier than it should be.*

The duplicated and correlated draws carry less information than $n$ independent draws, so the *effective* sample size is much smaller than $n$.

The fix is to give each thread its own random state, which NumPy's legacy functions such as `np.random.uniform()` do automatically under Numba.

```{code-cell} ipython3
@jit(parallel=True)
def calculate_pi_legacy(n):
count = 0
for i in prange(n):
u, v = np.random.uniform(0, 1), np.random.uniform(0, 1)
d = np.sqrt((u - 0.5)**2 + (v - 0.5)**2)
if d < 0.5:
count += 1
return (count / n) * 4
```

To see the cost of the race, we repeat each estimate many times and plot its spread against the correct version as the sample size grows.

```{code-cell} ipython3
sample_sizes = np.logspace(3, 6, 10).astype(int)
num_reps = 20

methods = [("per-thread state (correct)",
lambda n: calculate_pi_legacy(n), 'C0'),
("shared generator in prange (data race)",
lambda n: calculate_pi_in_loop(np.random.default_rng(), n), 'C1')]

fig, ax = plt.subplots()
for label, estimate, color in methods:
draws = np.array([[estimate(int(m)) for _ in range(num_reps)]
for m in sample_sizes])
means, stds = draws.mean(axis=1), draws.std(axis=1)
ax.plot(sample_sizes, means, color=color, marker='o', ms=3, label=label)
ax.fill_between(sample_sizes, means - 2 * stds, means + 2 * stds,
color=color, alpha=0.2)
ax.axhline(np.pi, color='k', lw=0.8, ls='--', label=r'$\pi$')
ax.set_xscale('log')
ax.set_xlabel('number of samples')
ax.set_ylabel(r'estimate of $\pi$')
ax.legend()
plt.show()
```

Both bands are centered on $\pi$, but the band associated with the data race is much wider than the other one and narrows very slowly as the sample size grows.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things on this data-race figure:

  1. Non-reproducible published output. This exercise intentionally produces different numbers/figure on every execution. QuantEcon runs notebooks in CI and publishes the rendered HTML, so this cell's output and the plot will change on every build. Worth an explicit maintainer decision (@mmcky / @HumphreyYang) — accept as a deliberate exception, or keep the race in prose while stabilizing the figure.
  2. Wording may overstate the effect. I reproduced the race on numba 0.61 / numpy 2.1 (8 threads): same-seed results do differ ✅, but the race std was only ~1.7× the legacy std at n=1e5, and the legacy std was already near the theoretical ideal. "much wider ... narrows very slowly" may be stronger than the rendered figure shows; the magnitude is thread-count-dependent. Suggest checking the figure on the build hardware and softening if needed.


The other safe option is the one from {ref}`numba_ex3`: draw the points before the loop so that the parallel loop only reads from memory.

```{solution-end}
```


```{exercise}
:label: numba_ex_draw_speed

We now have two correct ways to estimate $\pi$ in parallel.

One draws all the points *before* the loop, as in {ref}`numba_ex3`.

The other draws them *inside* the loop with legacy functions, as in {ref}`numba_ex_race`.

Compare their speed at `n = 100_000_000`, including the time spent generating the random points.
```

```{solution-start} numba_ex_draw_speed
:class: dropdown
```

We time each approach from start to finish, so the pre-draw version pays for building its arrays.

```{code-cell} ipython3
n = 100_000_000
rng = np.random.default_rng()

with qe.Timer():
u_draws = rng.uniform(size=n)
v_draws = rng.uniform(size=n)
calculate_pi(u_draws, v_draws)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution calls calculate_pi(u_draws, v_draws) expecting the parallel version from numba_ex3 — but that only holds because calculate_pi was redefined further up (see the naming note in speed_ex1). The comparison is correct only by top-to-bottom execution order. Renaming the parallel function would make this robust and self-documenting.

```

```{code-cell} ipython3
with qe.Timer():
calculate_pi_legacy(n)
```

Drawing inside the loop is much faster.

The pre-draw version generates its two arrays on a single thread before the loop begins.

The in-loop version spreads the random number generation across all threads instead.

It also avoids allocating two arrays of `n` numbers, so it saves both time and memory.

```{solution-end}
```

Expand All @@ -694,8 +932,8 @@ where

1. $\beta$ is a discount factor,
2. $n$ is the expiry date,
2. $K$ is the strike price and
3. $\{S_t\}$ is the price of the underlying asset at each time $t$.
3. $K$ is the strike price and
4. $\{S_t\}$ is the price of the underlying asset at each time $t$.

Suppose that `n, β, K = 20, 0.99, 100`.

Expand Down Expand Up @@ -750,6 +988,9 @@ $$

Using this fact, the solution can be written as follows.

Note that random draws are kept inside the inner loop rather than pre-allocated,
to avoid creating large shock arrays of size `M * n`.

Comment thread
HumphreyYang marked this conversation as resolved.

```{code-cell} ipython3
M = 10_000_000
Expand Down
Loading