An Interactive Tool for Understanding Experimental Design, Statistical Power, and Classification Metrics
Total Subjects: 200
Total Cost: $20,000
Confidence Level: 95%
Expected Error: ±13.9%
A well-designed experiment requires more than just testing whether sunblock prevents sunburn. We need to establish baselines and control for confounding factors. The simplest rigorous design uses four groups:
Sun + Sunblock
This is the primary test: Does the sunblock actually protect against sunburn when exposed to the sun?
Sun + No Sunblock
Confirms that the sun exposure is sufficient to cause burns. This establishes a baseline for expected damage and validates our experimental conditions.
No Sun + No Sunblock
Confirms that subjects don't spontaneously develop burns without sun exposure. This establishes the natural baseline state.
No Sun + Sunblock
Tests whether the sunblock itself causes any adverse skin reactions (redness, irritation) in the absence of sun. Critical for product safety and quality control!
Each control group serves a specific quality control purpose:
By introducing a "Half Sun" exposure level, we add three more groups. This expansion serves important quality control functions:
The six groups now include all combinations of {No Sun, Half Sun, Full Sun} × {No Sunblock, Full Sunblock}.
Adding "Half Dose Sunblock" creates our complete nine-group design. This provides comprehensive quality control coverage:
Factorial Design:
A k×m factorial design has k levels of factor A and m levels of factor B, creating k × m = total groups.
For our 3×3 design: 3 sun levels × 3 sunblock levels = 9 groups
Total subjects needed = n × (k × m), where n is the sample size per group
Main Effects:
$$\text{Effect of Sunblock} = \bar{Y}_{\text{with sunblock}} - \bar{Y}_{\text{without sunblock}}$$
$$\text{Effect of Sun} = \bar{Y}_{\text{with sun}} - \bar{Y}_{\text{without sun}}$$
where \(\bar{Y}\) represents the mean burn rate across all relevant groups.
Interaction Effects:
An interaction exists when the effect of one factor depends on the level of another factor.
$$\text{Interaction} = (\bar{Y}_{11} - \bar{Y}_{10}) - (\bar{Y}_{01} - \bar{Y}_{00})$$
If this value is significantly different from zero, the factors interact. For example, sunblock might be highly effective under full sun but provide minimal benefit under half sun.
If we find:
With n=50 per group and 95% confidence, our margin of error is ±13.9%, so we can confidently conclude the sunblock is effective.
In diagnostic testing and experimental validation, we classify outcomes into four categories based on what we expect versus what actually happens.
Definition: We expected a positive outcome, and we observed a positive outcome.
In sunblock testing: We expected a burn (sufficient sun exposure without adequate protection), and a burn occurred.
Interpretation: This confirms our prediction was correct. The sunblock failed as expected, or there was insufficient protection.
Example: A subject exposed to full sun with no sunblock gets burned. We predicted this would happen, and it did.
Formula Context: TP appears in the numerator of sensitivity, showing how many expected burns actually occurred.
Definition: We expected a negative outcome (no event), and we observed no event.
In sunblock testing: We expected no burn (either no sun exposure or adequate protection), and no burn occurred.
Interpretation: This confirms our prediction was correct. Either the sunblock worked, or there was no threatening exposure.
Example: A subject with no sun exposure and no sunblock has healthy skin. We predicted no burn, and there was none.
Formula Context: TN appears in the numerator of specificity, showing correct identification of non-burn cases.
Definition: We expected a negative outcome, but we observed a positive outcome.
In sunblock testing: We expected no burn, but a burn occurred anyway.
Interpretation: Something unexpected happened. This might indicate an adverse reaction to the sunblock itself, or contamination in the experimental setup.
Example: A subject with no sun exposure but full sunblock develops redness. This is alarming—the sunblock itself might be causing irritation!
Formula Context: FP appears in the denominator of specificity. High FP reduces specificity, indicating poor safety.
Definition: We expected a positive outcome, but we observed a negative outcome.
In sunblock testing: We expected a burn, but no burn occurred.
Interpretation: The intervention worked! The sunblock successfully prevented a burn that we expected would happen.
Example: A subject exposed to full sun with full sunblock remains burn-free. This is the desired outcome—the sunblock protected them.
Note: In medical testing, "false negative" usually means a bad outcome (missing a disease). Here, it's actually good—we "missed" predicting a burn because the sunblock worked!
Formula Context: FN appears in the denominator of sensitivity. High FN means low sensitivity (good for sunblock—it prevented many expected burns).
We organize these four outcomes into a confusion matrix:
Sensitivity (Recall / True Positive Rate):
$$\text{Sensitivity} = \frac{TP}{TP + FN}$$
What it measures: Of all the cases where we expected a positive outcome (burn), what fraction actually had a positive outcome?
In sunblock testing: Of all exposures expected to cause burns, what percentage actually resulted in burns? Low sensitivity is good—it means the sunblock is preventing most expected burns (high FN count).
Range: 0 to 1 (or 0% to 100%)
Example: If sensitivity = 0.05 (5%), then only 5% of expected burns actually occur. The sunblock prevents 95% of expected burns.
Why it matters: Sensitivity tells us how "leaky" our protection is. In medical diagnostics, high sensitivity means we catch most cases. In sunblock, low sensitivity paradoxically means good protection—we're preventing most burns that would have occurred.
Specificity (True Negative Rate):
$$\text{Specificity} = \frac{TN}{TN + FP}$$
What it measures: Of all the cases where we expected a negative outcome (no burn), what fraction actually had no burn?
In sunblock testing: Of all cases where we didn't expect burns (no sun or adequate protection), what percentage correctly had no burns? High specificity means the sunblock doesn't cause problems—it doesn't create burns on its own.
Range: 0 to 1 (or 0% to 100%)
Example: If specificity = 0.99 (99%), then 99% of no-burn-expected cases correctly had no burns. Only 1% had unexpected burns (likely adverse reactions).
Why it matters: Specificity measures safety and precision. High specificity means few false alarms—the sunblock doesn't cause unexpected problems. In medical testing, high specificity means few healthy people are incorrectly diagnosed as sick.
There's often a tradeoff between sensitivity and specificity. In sunblock testing:
A Receiver Operating Characteristic (ROC) curve plots sensitivity (True Positive Rate) on the y-axis versus 1-specificity (False Positive Rate) on the x-axis as we vary a decision threshold.
What is a threshold? For sunblock, the threshold might be "minimum SPF effectiveness." If we set a high threshold (demand very strong protection), we'll have:
If we set a low threshold (accept weaker protection), we'll have:
AUC Interpretation:
$$0.5 \leq \text{AUC} \leq 1.0$$
In sunblock testing: An AUC above 0.85 typically indicates the sunblock provides meaningful protection across various exposure levels and dosages.
Mathematical meaning: AUC represents the probability that a randomly chosen positive case (expected burn) ranks higher than a randomly chosen negative case (no burn expected) according to our scoring function.
Key Insight: The ROC curve and AUC summarize performance across all possible decision thresholds. A single sensitivity/specificity pair tells you performance at one threshold. The AUC tells you overall quality—how well the sunblock performs regardless of how strictly you define "adequate protection."
Relationship to Precision and Recall:
Sensitivity is the same as Recall in machine learning contexts:
$$\text{Recall} = \text{Sensitivity} = \frac{TP}{TP + FN}$$
Precision (Positive Predictive Value) is different:
$$\text{Precision} = \frac{TP}{TP + FP}$$
Precision asks: "Of all the cases we predicted as positive, how many were actually positive?"
F1 Score (Harmonic Mean):
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
The F1 score balances precision and recall, useful when you need a single metric that captures both.
Small samples can be highly misleading. Imagine testing sunblock on just 5 people. Even if it works perfectly, you might see 1-2 burns by pure chance (perhaps they had pre-existing skin sensitivity). Conversely, a useless product might appear effective in a small trial just by luck.
Statistical power is the probability of detecting a true effect when one exists. Larger samples increase power, but with diminishing returns. The formulas below account for finite population correction when your sample is a substantial fraction of the total population.
Standard Error (Infinite Population):
$$SE = \sqrt{\frac{p(1-p)}{n}}$$
where:
Interpretation: The standard error measures how much sample proportions vary from the true population proportion. Larger samples have smaller standard errors—they give more precise estimates.
Why p=0.5 maximizes variance: The variance of a proportion is p(1-p), which is maximized when p=0.5. This gives us the most conservative (largest) standard error estimate.
Finite Population Correction (FPC):
$$\text{FPC} = \sqrt{\frac{N-n}{N-1}}$$
$$SE_{\text{corrected}} = SE \times \text{FPC}$$
where:
When to use: When \(n/N > 5\%\) (sampling more than 5% of the population), this correction becomes important. It reduces the standard error because we're sampling a significant fraction of the entire population.
Why it matters: If you're testing 100 people from a population of 200, your estimates are much more precise than if you're testing 100 from a population of 100,000. The FPC accounts for this.
Limiting case: If n=N (you sample the entire population), FPC=0, and SE=0. You have perfect information with no sampling error.
Margin of Error (MOE):
$$\text{MOE} = z_{\alpha/2} \times SE_{\text{corrected}}$$
where \(z_{\alpha/2}\) is the critical value from the standard normal distribution:
Interpretation: We're 95% confident (if using 95% confidence level) that the true population value falls within our sample estimate ± MOE.
Example: If we measure a 30% burn rate with MOE = ±7%, we're 95% confident the true burn rate is between 23% and 37%.
Why α/2: We split α (e.g., 0.05 for 95% confidence) between the two tails of the distribution, so each tail has α/2 = 0.025.
Statistical Power (Simplified):
$$\text{Power} = 1 - \beta$$
where \(\beta\) is the probability of Type II error (failing to detect a real effect).
For proportion tests, power can be approximated as:
$$\text{Power} \approx \Phi\left(\frac{|\text{Effect Size}| - z_{\alpha/2} \cdot SE}{SE}\right)$$
where:
Interpretation: Power is the probability that we'll detect an effect if it truly exists. Higher power means we're less likely to miss real effects.
Factors that increase power:
For 80% power to detect a 20% effect:
For 90% power to detect a 15% effect:
For detecting small effects (5-10%):
Doubling to n=100 would cost $90,000 but reduce MOE to ±10% and increase power to ~95%.
Sample Size Formula (Solving for n):
To achieve desired power for detecting a specific effect size:
$$n = \frac{(z_{\alpha/2} + z_\beta)^2 \cdot [p_1(1-p_1) + p_2(1-p_2)]}{(p_1 - p_2)^2}$$
where:
Here's an elegant mathematical analogy: Finding a real effect in an experiment is like finding where a polynomial function crosses the x-axis (its real roots). The function represents your measurement as you vary experimental conditions, and a "crossing" represents a detectable effect.
Consider a polynomial function of degree 4:
$$f(x) = ax^4 + bx^3 + cx^2 + dx + e$$
where \(a, b, c, d, e\) are coefficients that determine the shape of the curve.
Domain: \(x \in \mathbb{R}\) (all real numbers)
Range: \(f(x) \in \mathbb{R}\)
Roots: Values of \(x\) where \(f(x) = 0\)
Mathematical meaning: A real root is a point \(x_0 \in \mathbb{R}\) where \(f(x_0) = 0\). The graph crosses the x-axis at this point.
Experimental meaning: A real root represents a condition where your intervention makes a measurable, observable difference. The outcome crosses a threshold from "no effect" to "clear effect."
Example: In sunblock testing, a real root might represent the minimum SPF value where burn prevention becomes statistically significant. Below this threshold, burns occur; above it, they don't.
Visualization: On a graph of "burn rate vs. sunblock SPF," a real root is where the curve crosses the "acceptable burn rate" threshold.
Mathematical meaning: A complex root exists in \(\mathbb{C}\) (the complex number plane) but not in \(\mathbb{R}\). It has the form \(z = a + bi\) where \(i = \sqrt{-1}\). The function never crosses the x-axis—it stays entirely above or below it.
Experimental meaning: A complex root represents "no observable effect." Your intervention might have some theoretical influence, but it never manifests as a detectable, measurable change in the real world.
Example: A sunblock with complex roots would never show statistically significant protection, no matter how you adjust the dosage or exposure—it simply doesn't work in observable reality.
Mathematical property: Complex roots of polynomials with real coefficients always come in conjugate pairs: if \(a + bi\) is a root, then \(a - bi\) is also a root.
Mathematical meaning: The function comes very close to zero (\(|f(x)| < \epsilon\) for small \(\epsilon\)) but doesn't quite cross. Mathematically, there might be real roots very close by with slightly different coefficients.
Experimental meaning: You almost detected an effect. With a larger sample size (more precision in measuring the function), you might have caught it. This is the signature of an underpowered experiment.
Example: Your sunblock reduces burns from 50% to 35%, but with n=20, the confidence interval is ±18%, so the result is "not significant." With n=100, you'd detect it clearly—the function would cross zero.
Numerical analysis parallel: Just as numerical root-finding requires sufficient precision (small step size), detecting effects requires sufficient statistical power (large sample size).
Sensitivity to Parameters: Just as changing polynomial coefficients can create or destroy roots, changing experimental parameters (sample size, measurement precision, control of confounds) can determine whether effects become detectable.
Example: Small coefficient changes: \(f(x) = x^4 - 4x^2\) has two real roots at \(x = \pm\sqrt{2}\), but \(f(x) = x^4 - 4x^2 + 1\) has no real roots—just by adding +1!
Experimental parallel: Increasing n from 40 to 60 might be the difference between detecting and missing a 15% effect—a small change with big consequences.
Adjust the coefficients below and watch how the number of real roots changes. Notice how small changes in parameters can make effects appear or disappear—just like how small changes in experimental design or sample size can determine whether you detect a real effect.
Fundamental Theorem of Algebra:
A polynomial of degree \(n\) has exactly \(n\) roots (counting multiplicity) in the complex number system \(\mathbb{C}\).
$$\deg(f) = n \implies f \text{ has } n \text{ roots in } \mathbb{C}$$
For our degree-4 polynomial: Total roots = 4 (some may be real, some may be complex)
$$\text{Real Roots} + \text{Complex Roots} = 4$$
Important property: Complex roots of polynomials with real coefficients always come in conjugate pairs. So for a degree-4 polynomial with real coefficients, you can have:
Why conjugate pairs? If \(z = a + bi\) is a root, then \(f(z) = 0\). Taking the complex conjugate: \(\overline{f(z)} = \overline{0} = 0\). Since coefficients are real, \(\overline{f(z)} = f(\overline{z})\), so \(f(a - bi) = 0\). Thus \(\overline{z} = a - bi\) is also a root.
Finding Roots Numerically:
For most polynomials beyond degree 2, we can't solve for roots analytically. We use numerical methods:
Newton-Raphson Method:
$$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$$
Starting from an initial guess \(x_0\), this iteratively refines the estimate until \(|f(x_n)| < \epsilon\) for some tolerance \(\epsilon\).
Derivative for our polynomial:
$$f'(x) = 4ax^3 + 3bx^2 + 2cx + d$$
Experimental parallel: Just as Newton's method requires good initial estimates and sufficient iterations, experiments need good pilot studies to estimate effect sizes and adequate sample sizes to converge on the truth.
Convergence rate: Newton's method has quadratic convergence near roots—the number of correct digits roughly doubles each iteration. Similarly, doubling sample size roughly doubles precision (halves standard error).
Multiple Effects: A polynomial can have multiple real roots, just as an intervention can have multiple detectable effects at different parameter values. Our 3×3 sunblock design reveals effects at different dose/exposure combinations—multiple "crossings" of the effectiveness threshold.
Example: \(f(x) = (x-1)(x-2)(x-3)(x-4) = x^4 - 10x^3 + 35x^2 - 50x + 24\) has four real roots at x=1,2,3,4. In experimental terms, this represents four distinct conditions where an effect is detectable.
Continuous vs. Discrete: The polynomial is continuous (smoothly varying), while our 9-group design samples at discrete points. This is why factorial designs are powerful—they sample the "function" at strategic points to characterize its overall behavior.
Interpolation insight: With measurements at 9 strategically chosen points, we can fit a polynomial model and interpolate between them, predicting effects at untested combinations.
The Reality of Effects: A real root exists whether or not we can compute it accurately. Similarly, a true effect exists whether or not our experiment detects it. Increasing sample size is like increasing the resolution of our root-finding algorithm—we get closer to the truth.
Philosophical point: Just as \(\pi\) has infinite decimal digits but we only compute finitely many, true effects exist with infinite precision, but our experiments only measure them approximately.
You test a sunblock with n=30. You find a 12% reduction in burns, but it's "not significant" (p=0.08). You conclude "no effect."
A competitor tests the same sunblock with n=120. They find a 13% reduction and p=0.002, "highly significant."
What happened? The effect (the "real root") was always there. Your n=30 study had insufficient precision—like trying to find a root with a low-resolution graph. The function came close to zero, but you couldn't definitively say it crossed.
With n=120, the precision increased, and the crossing became clear. The effect didn't change—your ability to detect it did.
Polynomial analogy: It's like having \(f(x) = x^4 - 4x^2 + 0.1\) and trying to tell if there are real roots near \(x = \pm\sqrt{2 - \sqrt{0.1}}\). With coarse numerical precision, you might miss them. With fine precision, they become visible.
Mathematical lesson: The roots of \(x^4 - 4x^2 = 0\) are clearly at \(x = 0, \pm\sqrt{2}\). But for \(x^4 - 4x^2 + 0.1 = 0\), numerical methods are required, and low precision might incorrectly conclude "no real roots" when they actually exist.
This mathematical framework reminds us that experimental design is fundamentally about creating conditions where true effects become visible—where the "function" of our measurements crosses the threshold of detectability with sufficient clarity and confidence.