Notes

Sample size calculation app

I\’m currently working on a project that\’s looking at sampling for detection of infected units. Specifically where we don\’t care about identifying infected units, we just don\’t want any!

The standard way of calculating sample size for this is to assume that the batch/lot you\’re inspecting contains a minimum number of infected units. You further can\’t sample all of them, as we\’re using destructive sampling, e.g. via PCR or ELISA.

This is a pretty easy problem. If we have a batch of size $n$ that contains $m$ infected units and take a sample (without replacement) of size $k$, then the number of infected units in the sample, $X$, has a hypergeometric distribution

\begin{align} X & \sim \text{Hypergeometric}(x, m, n, k)
\Pr(X = x) & = \frac{ {m \choose x} {n \choose k-x} }{ {m+n \choose k} } \end{align
}

for $x \in {0, 1, \ldots, m}$.

Now, all we care about in detecting infected units is that we detect at least one of them, if it\’s present in the sample:

\begin{align} \Pr(X \geq 1) & = 1 - \Pr(X = 0)
& = \frac{ {n \choose k} }{ {m+n \choose k} }. \end{align
}

To find the sample size we need to inspect, we simply solve the above for $k$, given the confidence ($1 - \alpha$) we want, and the batch/lot size $n$ and number of infected units in the batch/lot, $m$.

Rather than the number of infected units in the batch/lot, $m$, it is common to specify a design prevalence ($p$) instead. This is the proportion of infected units in the sample. Obviously this poses a problem in the above specification: we can\’t have a non-integer number of infected units in the batch! OK then, we can just round the number of infected units down. In the case where our chosen design prevalence results in the number of infected units in the batch being 0, we should set $m=1$ so that we sample something! Thus, in terms of the design prevalence $p$, we have:

\begin{align*} m{p} & = \min \left{1, \left \lfloor {p*n} \right \rfloor \right}
p
{a} & = m_{p}/n \end{align*}

where $m{p}$ is the number of infected units at or below design prevalence $p$, and $p{a}$ is the apparent prevalence: the prevalence that is apparent due to possible rounding down of $m_{p}$.

This is all straightforward. I whipped up the following shiny app to demonstrate:

Embedding shiny

  • during this write-up, I was looking up how to embed a shiny app
    • pretty simple by using an iframe:
<iframe src="link.to.hosted.shiny.app" style="border: none; width: 100%; height: 700px"></iframe>