MCPower limitations and edge cases
MCPower's numbers are only as good as the assumptions and the method behind them. This page collects the places where MCPower is the wrong tool for the question, or where a result deserves a second look before it goes into a grant. None of these are hidden failure modes — each is a documented property of how the tool works, listed here so you can plan around it.
Power estimates carry simulation noise
Every power number is a Monte Carlo estimate, not an exact value. An estimate from \(n_\text{sims}\) simulations carries a standard error of about \(\sqrt{p(1-p)/n_\text{sims}}\) — at the default 1,600 simulations and 50% power, roughly ±1.25 percentage points. So don't over-read small differences: 78.9% vs 80.2% from separate runs is noise, not a finding. When two designs land close together and the distinction matters, raise the simulation count — see simulation settings.
Results can differ from G*Power — by design
MCPower simulates the predictors anew in every dataset (a random-X, unconditional design), the way data actually arrives in observational and survey research. Analytic calculators such as G*Power instead treat the design matrix as fixed and known in advance (fixed-X), which is natural for fully controlled experiments. The two answer slightly different questions, so their power numbers can legitimately differ — most visibly at small samples. That gap is a framing difference, not an error in either tool.
The tests are large-sample tests (logistic and mixed)
OLS power uses the exact Student-t test, at any sample size. Logistic and mixed-model power use Wald z tests — the standard large-sample approximation — not likelihood-ratio tests or small-sample degree-of-freedom corrections (such as Satterthwaite). At reasonable sample sizes the difference is negligible; with very small samples, or a mixed design with only a handful of clusters, the z approximation runs slightly optimistic. Treat power estimates at those edges with extra caution, and prefer more clusters over more rows per cluster when you can.
Random slopes only on the primary grouping
In a mixed model with more than one grouping factor — crossed factors like
(1|subject) + (1|item), or a nested pair like (1|school/class) — random
slopes can be placed only on the first (primary) grouping. Any additional
grouping factor enters as a random intercept only: a single variance
component, set through its ICC. So (1 + treatment | subject) + (1 | item) is
expressible, but a varying treatment slope across items as well is not.
For the common case this is enough — you usually want a varying slope for your
focal grouping and intercepts for the secondary one. If your analysis genuinely
needs slopes on a second grouping (something tools like lme4 allow), MCPower
can't express it today; this is a planned addition, available on request
(see the roadmap).
Mixed-effects structure has depth limits
Beyond the random-slopes restriction above, the mixed-model structure is bounded
in two ways. They cover the designs people actually run, but stop short of full
lme4 generality:
- One level of nesting. A single nested grouping such as
(1|school/class)is supported; a deeper chain like(1|school/class/student)is not. - Crossed factors need a fixed number of clusters. When you add a crossed
grouping —
(1|subject) + (1|item)— the primary grouping must be sized by number of clusters, not by cluster size, because a crossed factor is crossed against a fixed count of primary clusters. The app disables the "by cluster size" toggle in that case.
(There are also generous ceilings — at most seven extra grouping factors, and eight random-effect terms on the primary cluster — that realistic designs do not reach.)
Outcome families are limited
The engine fits three model families: continuous outcomes (OLS), binary outcomes (logistic regression), and clustered continuous outcomes (mixed models). Counts, ordinal scales, survival times, and multinomial outcomes are out of scope — power for those needs a different tool.
Some tests are only available for some models
Two test types are tied to particular model families:
- Post-hoc pairwise comparisons are OLS-only. Tukey-style all-pairs comparisons between a factor's levels are produced for continuous-outcome (OLS) models. They are not offered for logistic or mixed models, whose pairwise corrections behave differently. This could be extended to those families on request.
- The overall (omnibus) test covers OLS and unclustered logistic only. The single "is the model as a whole significant?" test — the F-test for OLS, the likelihood-ratio test for plain logistic regression — is not produced for mixed-effects or clustered-logistic models, where a well-behaved omnibus needs a different construction. Test individual terms, or a joint test of a chosen set of terms, instead. An omnibus for the mixed families could be added on request.
Correlations are between continuous and binary predictors only
The predictor correlation structure applies to continuous and binary predictors. Multi-level categorical factors cannot be entered into the correlation matrix — their dependence with other predictors is not something you set directly. Specify the correlations among the continuous and binary predictors you need linked, and leave factors out of the correlation specification.
Heterogeneity imposes a power ceiling
With heterogeneity turned on — the realistic and doomer scenarios, or any
custom heterogeneity > 0 — each simulated study draws its own true effect. At
large values some of those studies draw essentially no effect (or the wrong
sign), and those just can't be detected at any sample size. That puts a hard
ceiling on the maximum power you can reach, and it also makes the last stretch
up to that ceiling much harder. More data won't help — the ceiling is
structural, not estimation noise, so keep it in mind.
| heterogeneity | rough power ceiling | notes |
|---|---|---|
| 0.0 | none | default optimistic |
| 0.2 | ~99.99% | default realistic |
| 0.4 | ~99% | default doomer |
| 0.5 | ~98% | custom |
| 1.0 | ~84% | custom (extreme) |
This bites most on stringent designs. If you need very high power — say ≥ 99% —
at a strict alpha (for example α = 0.01 with a Bonferroni correction across ten
tests), a non-zero heterogeneity scenario can put your target above the
ceiling, and a required-sample-size search will report it as unreachable at any
N. When that happens, ask whether per-study heterogeneity is the right
assumption: for a tightly controlled, single-population, single-protocol study
where the effect is plausibly homogeneous, set heterogeneity = 0 in that
scenario (see scenario analysis) rather than
chasing a sample size that cannot move the ceiling.
Same seed, different parallelism — slightly different numbers
Reproducibility in MCPower is a per-machine, per-seed guarantee: same machine, same seed, same inputs, same numbers, every time. It is not a promise that every product walks the same random path — a run split across a different number of workers (the browser app, most visibly) draws different random numbers and lands on a slightly different estimate. The two results are statistically equivalent, within the Monte Carlo noise above, but not byte-identical. See why two runs aren't byte-identical.
App, Python, and R agree — to the last decimal that matters
The same analysis run on the desktop app, in Python, and in R gives the same answer. Within any one of them, a seeded run reproduces exactly — same seed, same inputs, same numbers, every time. Across them, results are identical for every practical purpose: the only difference is floating-point rounding in the last bits of each number, far below anything you would report.
The lone observable consequence would be a single significance call landing exactly on the α boundary — a p-value tied to α to the last bit — and flipping by one decision between two faces. With continuous estimates that coincidence is vanishingly rare: on the order of once in a billion years of running. It is written down here for honesty, not because it is a concern you need to plan around — treat App, Python, and R as interchangeable.
Sparse factor levels at small N
A factor level needs observations to be estimable. If any level of a factor would receive fewer than 5 observations at a given sample size, MCPower excludes the whole factor from the model in that run: its effects report power 0, the other predictors are still analysed, and the result carries a diagnostic naming the factor and how often it was excluded.
With the default exact group allocation this is deterministic — you are told
before the simulation starts which factor is affected and, in a sample-size
search, the smallest N in the searched range that clears the minimum. (With
sampled allocation or uploaded data the counts vary per run, so there is no
up-front warning — the post-run diagnostics still report any exclusion.) As a
rule of thumb a level with proportion p needs roughly N ≥ 5 / p: a 5%
level needs about 100 observations just to be estimable, well before it has
any power.
If you see this warning: increase N, raise the sparse level's proportion, or merge rare levels. See variable-types for setting factor proportions.
Upload size depends on where MCPower runs
Pilot-data uploads are capped by platform: up to 1,000,000 rows in the desktop app and the Python and R packages, but 10,000 rows in the browser app, whose memory budget as a browser tab is tighter. A dataset larger than 10,000 rows has to use the desktop app or a package. The same engine, models, and numbers run everywhere; only the upload ceiling differs — see which MCPower to use.
Uploaded data is a description, not a guarantee
Uploading pilot data shapes the data-generating process — the simulated predictors inherit your sample's distributions and dependence. They also inherit its flaws: a tiny, biased, or unrepresentative pilot produces a faithful simulation of an unrepresentative world, and the power number carries that bias forward. Uploading makes assumptions concrete; it cannot make them correct. See using empirical data.
For how the parts that are in scope get verified, see Validation.