Simulating Experimental Design for N=1 Causal Analysis
C.P. van der Velde.
[First website version 02-04-2025]
1.
Introduction
It can hardly be underestimated how important a role
causal analysis plays
in everyday judgment at all levels,
from the personal and interpersonal to the social and political.
Many statements and discussions involve causal assertions,
even though they are not always recognized as such.
Take a - politically "safe" - example like "cats are lazy."
While this constitutes a categorization, predication and generalization,
it simultaneously makes a causal assertion, in the sense of "Being a cat causes laziness,"
or "Something about cats makes them lazy."
A further observation may be that many disagreements and discussions, conflicts, and even wars revolve
around causal attributions.
We must therefore conclude that people can identify very different causes and effects
for the same situations. Even judges with equal authority can make completely different judgments
about guilt and liability in the same legal dispute.
There is every reason to ask the meta-causal question:
what, in general, is the cause of the enormous diversity in causal judgments?
Undoubtedly, a huge variation in capacities, interests and motives plays an important role here.
But it must also be said that causal analysis is no easy task in many situations.
This applies to people - both laypeople and scientists -
but also to "smart" computer systems such as today's LLM-based AI variants.
Primitive Baseline :
The current causality model of LLM based AI (or prior instances) is rather primitive and quite fuzzy,
often parroting flawed conventions (e.g., mistaking chronology of coincidence for causality,
and statistical "significance" for confirmation). This suggests deeper rewiring is needed,
undercutting claims of easy fixes.
In AI, a recurrent problem is to infer, attribute or deduct
causal relations
in cases of single instances.
Proving causality in an N=1 case - like "
this pill eased my headache" -
means confirming one event caused another in that particular situation.
Unlike large-sample studies with statistical backup, an N=1 scenario stands alone, no repeats to lean on.
In this essay we will explore how to
simulate a full experimental research design,
the kind scientists use for bigger samples, like
randomized controlled trials (RCTs), but tailored to
a single instance.
2.
Testing on Criteria for Causality in Large N Cases
Let's first look at the way
randomized controlled trials (RCTs) try to detect causal relations.
In that, they rely on computing
correlation coefficients (most commonly the
Pearson formula,
but also
Spearman,
Kendall's Tau, etc.) - or their derivatives like
Chi-square,
Student's t,
Fisher's F (or
F-ratio) etc.).
These are often required - but by far
not sufficient - to infer a causal relation.
Correlation versus Causation.
•
Of different domain.
Correlation does not constitute any relation in the referential domain, nor does it represent it
in a direct or complete way. It's just a
stand-alone quantitive metric of
symmetrical variation
, or
co-variation across instances, between two or more sets of
numbers
- like measurement data of pill doses as assumed
independent variable (or
causes)
and pain levels as assumed
dependent variable (or
effects).
Being an
abstract pattern it belongs to the
information domain, which lacks any causal rules
and is intrinsically distinct from domains were causality rules, like the
physical c.q.
empirical
domain.
•
Many requirements for statistical validity.
To calculate
valid correlations you need, aside from sufficiently high N,
many additional requirements like:
(·) Both experimental and control conditions/groups are observed (in social sciences
also a placebo condition/group), which are of reasonably comparable (balanced) size and
composition.
(·) Random assignment: All sample units are randomly picked from the population
and randomly assigned to conditions (groups);
experimenters/observers are randomly assigned to sample units and conditions.
(·) Double blind design: Neither the subjects nor the experimenters know
which subjects are in the test and control groups during the actual course of the experiments.
(·) Sample units being invariant and replications of each other.
(·) Incentives are adminstered independently and in controlled dosages.
(·) Test procedures, measurements, and immediate environments (setting of trials,
context of sample units) are kept 100% standardized, controlled and constant over sample size N.
(·) Any possible impact (i.e. variance) from other variables within the setting
(outside the immediate scope of research) being firmly excluded, i.e. held isolated or fixed,
to exclude them as possible confounders - or otherwise have them systematically varied,
accurately measured and taken into account as covariants in a multivariate analysis.
•
Of different order.
Correlations are inherently
bi-directional (the derivational direction of their derivates
like regression equations being completely arbitrary to chose).
Hoewever, most causal dependencies in the 'real world' (
physical domain above
quantum levels)
are (mainly)
uni-directional (e.g., "
smoking causes cancer") and rarely or only partly
reciprocal of nature.
Building uni-directional models requires testing on additional criteria.
(·) Chronology being measured as opposed to simultaneity, and only one-directional
(to exclude e.g. reciprocity, common causal factors, etc.);
(·) Latency time being in accordance with the suspected mechanism, and fairly constant
over all cases, (e.g., to exclude intermediate factors).
•
Needs high levels to have predictive power.
The point often, if not almost, overlooked, even massively by academic scholars,
is that for causality to assume you have to prove that
explanatory /
predictive power
of the relation is
better then random, the independent value
rendering at least 1 bit or more information about the dependent value:
NB1 The widely overvalued criterium of "(statistical) significance"
is merely irrelevant to the question of predictive power. At the usual minimum (alpha) levels
of 1 to 5 %, it is already attained at very low correlation values. Even then, it only tells
that the effect-difference found is "not entirely attributable to chance",
so there's no considerable proof for refutation (yet), which is not at all the same as
"therefore reasonable positive proof for confirmation".
NB2 Indeed, we should respect Popper's principle of not using affirmative results
as conclusive 'proof', confirmation or indicating probability of the hypothesis being true, but only as
corroborative to the hypothesis, reflecting the degree to which it has been tested
and not yet been falsified. This is however entirely different of using a marginal probability
(like ≤ alpha) of not yet being falsifiable as an evidence based corroboration - while
lacking evidence for that conclusion.
To elucidate this futher: in most, if not all cases, correlations passing the test of "
(statistical) significance" render proportions of explained variance far lower then 50%:
the variables not even being able to predict their mutual overall variance
(which still not tells anything about specific values) outside the realm of pure chance.
Just surpassing the minimum level of
random guessing already requires very high correlation values
•
No clue of missed covariants.
In general, when correlation is less than perfect (±1), the missing part hints at an
actual
impact of
covariants, like alternative (
disjunct) or necessary (
conjunct) causal
factors, - like water or absorption - but doesn't yet identify or measure this.
When relevant, a multitude of
covariants would nessecitate a
multivariate analysis design to
incorporate their respective impacts.
•
No clue of covert confounders.
A correlation of whatever value doesn't in itself reveal whether it is "
clean" or
spurious:
in the last case, to some extent "vexed" or "polluted" by "
hidden" variables, or
confounding
factors, that again may consist of alternative (
disjunct) or necessary (
conjunct) causal
factors, because of crucial conditions insufficiently ensured during the experiment (e.g.,
double-blindness,
placebo-control,
randomization,
balancing,
stratification,
matching/ pairing,
standardization,
constancy,
isolation, etc.).
In short, correlation is never a conclusive indicator of causation.
In RCTs, a relatively high correlation value of e.g. 0.8 hints at a link, but stil needs considerable
unpacking, testing and checking on many requirements before causation can be established. The many
nuances and complications are even multiplied and deepened when performed in the context of
present day LLM based AI systems.
3.
Using
Correlation in AI
In the AI community, LLMs are often said to use "
implicit correlations", computing
correlation coefficients "
indirectly", or performing (some kind of) correlation-based calculation like
multivariate analysis, etc. "
inherently". This is however nonsense.
Correlation is a precise mathematical concept (Pearson, Spearman, Kendall, etc.) that requires
specific and
explicit calculation over paired samples of variables. If refered to as "
implicit
", correlations should be
demonstratively derivable, and be
reproduced instantly,
at any moment, e.g., at a users request of quiery.
Correlation versus AI.
•
Not standard repertoire in AI.
Current LLMs and pattern learning models do
not "use correlations"
in any technical or statistically valid sense
unless explicitly instructed to calculate them.
What AI machines do is "learning" complex relationships through training on large datasets and generalize
co-occurrence frequencies,
regardless of statistical criteria like representativeness,
predictive power and reliability. That is inevitably and fundamentally different than computing or using
any standard form of
explicit or
implicit correlation metric.
Using "
correlation" loosely to describe such learned statistical dependencies is
misleading
at best and
incorrect at worst.
•
Hard to obtain in AI.
Of course, AI systems have many trillions of examples of examples available, so huge N in principle.
In theory that would provide a near-perfect basis for computing correlations.
(·)
Data Sorting Nightmare.
Valid correlations require controlled, standardized conditions across massive N, which is near-impossible
for noisy, real-world data (e.g., social media posts). Sorting and categorizing trillions of examples
into experimental/control groups demands astronomical preprocessing, clashing with "near-term
" optimism.
(·) Multivariate Mess.
Mapping real-world causality often needs multivariate analysis, including techniques like
multivariate regression, principal component analysis, and others,
that allow for the examination of interdependencies among several variables.
This however explodes computational and data requirements.
In ML, models like decision trees, random forests, and so-called "neural" networks are used to simulate
some kind of multivariate analysis by considering multiple input features to make predictions.
Thus, it's practically impossible for AI machines to sort such massive, noisy data
(e.g., trillions of examples) under the controlled conditions that are required,
also given the complications of uni-directional dependencies and multivariate analysis.
Even with huge N, ensuring invariant samples and stable contexts is a huge systemic hurdle.
•
Limited to Co-occurances.
In NLP, dependency parsing involves identifying directed relationships between words in a sentence,
such as subject-verb-object structures. It's analysis remains restricted to
syntactical surface structure, but it doesn't even build or reconstruct a solid grammer composition
(and thus tends to commit numerous errors in detecting scope of constituents, embeddings, etc.).
What is doesn't perform at all is analysis of
semantic structure,
logical or
reasoning structure,
psychological structure, and certainly not
causal structure.
•
Crawling history becomes sample.
In reality, the
samples that LLMs use consist of all instances of texts they processed
untill a point in time: in principle, all thinkable ways people can talk about the world.
The
population about which LLMs perform their
predictions however consists of the entire "
real world" or even "
integral universe", including its
fundamental domains of
phenomena and
core dimensions of information: language/ communication, (or
syntax-to-semantic relations), logic (or
abstract patterns), causality
and psychological structure (or
mental patterns).
This huge mismatch between
samples and
populations constitutes an immense gap
that seems impossible to
bridge in order to derive valid generalizations.
4.
Testing on Criteria for Causality in N=1 Cases
The following isn't a loose collection of checks - it's a unified causal logic,
systematically testing every link to validate or debunk it. Here's the complete process,
tying research (N>1) to N=1 contexts.
•
Temporal sequence.
Causality demands order in time, chronology: effect follows cause.
In randomized controlled trials (RCTs), researchers log treatment (A) before outcome (B)
- drug given, then pain drops, tracked across many.
For N=1, verify: did the pill come before relief? A record like "took it at 3 PM, better by 3:20"
nails it. If relief hit first, the claim's sunk unless timing's wrong.
• Mechanistic plausibility.
RCTs test if a process - like a drug's chemistry - links A to B.
In N=1, the question is: does science support it? If the pill's compound dulls inflammation, it's plausible.
"a shout stopped rain" isn't without a wild tie.
Real life situations however often lack tests from scientific research that reveal the mechanisms involved,
and need additional checks.
• Experimental condition.
This is crucial - does A trigger B?
RCTs give A to a treatment group and measure B - drug administered, pain fades.
Causality needs this active test.
For N=1, we test if B occurs with A present: "take pill, relief follows,"
maybe in a repeat (real or modeled).
Without this test, we can only guess if the pill did anything.
However, finding an affirmative example of "true positive" case is by far not enough.
We have to actively and purposefully search for any counter-example or "black swan" in the
experimental condition: is by any chance a situation to be found or reasonably possible, in which
A occurs without B rising (or varying)?
• Control condition.
RCTs skip A in a control group - if pain stays, the drug matters.
But it's wider: could water, not pill, have done it?
Thus we may rank covariates - alternative factors - by fit.
This test isolates A's necessity; skip either, and rivals cloud the picture.
For N=1, check if B stays out with A absent: "no pill, headache lingers," using a baseline.
In the control condition we also need to actively and purposefully search for any counter-example
or "black swan": can a situation be found or reasonably possible, in which
A stays absent (or constant) but with B nevertheless appearing?
• Proportionality.
Research matches cause to effect - stats like effect size show a drug's impact scales with relief.
For N=1, ask: does the pill's dose fit the relief? A small pill easing a migraine works if potent;
a tap sinking a ship needs massive leverage (e.g., hull flaw). Causality demands balance
- effect scales to cause.
• Correlation.
In a true N=1 case, calculating a meaningful correlation is impossible. We may find one incident of
co-occurence, which might show pill and relief align - possibly useful for further exploration,
yet very limited.
Sometimes a suitable correlation may already be available from prior research to be applied
for the specific case at hand. In general hoewever, such deductive applications
have numerous problems and complications as we've discussed above.
• Fitting latency time.
Research measures delays - 30 minutes for a pill to work, averaged over patients.
For N=1, check: 20 minutes for relief fits pharmacology; 2 seconds doesn't. Off timing breaks the link
- causality needs a realistic pace.
• Checks on Intermediate causes.
RCTs model steps - drug boosts blood levels, then eases pain.
In an N=1 case, trace: "pill taken, absorbed, relief." No chain - like "yell caused blackout
" - weakens it. Causality often rides on these steps, not just start to finish.
• Checks on Common cause.
Trials control for a third factor - like stress driving both A and B.
This checks if A's the real driver or only a co-passenger.
• Consistency over replications.
RCTs replicate their findings - drug works across labs, it's solid. For N=1, test: does "
pill eases pain" hold per pharmacology everywhere? If it defies known rules, it's highly questionable.
Causality isn't a quirk - it fits reality's frame.
3.
Wrap-Up
This N=1 design - sequence, mechanism, latency, experiment, control (with alternatives), proportionality,
correlation, intermediates, common cause, consistency - builds a causal chain. For "
pill eased headache,
" it's true if: pill's first, biology fits, timing's right,
pill triggers relief, no rest or water steals it, relief matches pill's power, and science backs it.
Correlation may flag confounders but can't seal it - only this full logic turns one case into proven cause.