Simulating-Experimental-Design-for-N-1-Causal-Analysis

**Method 'Practical Logic'**

Principles of Information

Simulating Experimental Design for N=1 Causal Analysis

C.P. van der Velde.

[First website version 02-04-2025]

1. Introduction

It can hardly be underestimated how important a role causal analysis plays in everyday judgment at all levels, from the personal and interpersonal to the social and political. Many statements and discussions involve causal assertions, even though they are not always recognized as such. Take a - politically "safe" - example like "cats are lazy." While this constitutes a categorization, predication and generalization, it simultaneously makes a causal assertion, in the sense of "Being a cat causes laziness," or "Something about cats makes them lazy."
A further observation may be that many disagreements and discussions, conflicts, and even wars revolve around causal attributions. We must therefore conclude that people can identify very different causes and effects for the same situations. Even judges with equal authority can make completely different judgments about guilt and liability in the same legal dispute. There is every reason to ask the meta-causal question: what, in general, is the cause of the enormous diversity in causal judgments? Undoubtedly, a huge variation in capacities, interests and motives plays an important role here. But it must also be said that causal analysis is no easy task in many situations. This applies to people - both laypeople and scientists - but also to "smart" computer systems such as today's LLM-based AI variants.

Primitive Baseline :

The current causality model of LLM based AI (or prior instances) is rather primitive and quite fuzzy, often parroting flawed conventions (e.g., mistaking chronology of coincidence for causality, and statistical "significance" for confirmation). This suggests deeper rewiring is needed, undercutting claims of easy fixes.

In AI, a recurrent problem is to infer, attribute or deduct causal relations in cases of single instances.
Proving causality in an N=1 case - like "this pill eased my headache" - means confirming one event caused another in that particular situation.
Unlike large-sample studies with statistical backup, an N=1 scenario stands alone, no repeats to lean on.

In this essay we will explore how to simulate a full experimental research design, the kind scientists use for bigger samples, like randomized controlled trials (RCTs), but tailored to a single instance.

2. Testing on Criteria for Causality in Large N Cases

Let's first look at the way randomized controlled trials (RCTs) try to detect causal relations. In that, they rely on computing correlation coefficients (most commonly the Pearson formula, but also Spearman, Kendall's Tau, etc.) - or their derivatives like Chi-square, Student's t, Fisher's F (or F-ratio) etc.).
These are often required - but by far not sufficient - to infer a causal relation.

Correlation versus Causation.

•

**Of different domain.**

Correlation does not constitute any relation in the referential domain, nor does it represent it in a direct or complete way. It's just a stand-alone quantitive metric of symmetrical variation , or co-variation across instances, between two or more sets of numbers - like measurement data of pill doses as assumed independent variable (or causes) and pain levels as assumed dependent variable (or effects).
Being an abstract pattern it belongs to the information domain, which lacks any causal rules and is intrinsically distinct from domains were causality rules, like the physical c.q. empirical domain.

•

**Many requirements for statistical validity.**

To calculate valid correlations you need, aside from sufficiently high N, many additional requirements like:

(·) Both experimental and control conditions/groups are observed (in social sciences also a placebo condition/group), which are of reasonably comparable (balanced) size and composition.
(·) Random assignment: All sample units are randomly picked from the population and randomly assigned to conditions (groups); experimenters/observers are randomly assigned to sample units and conditions.
(·) Double blind design: Neither the subjects nor the experimenters know which subjects are in the test and control groups during the actual course of the experiments.
(·) Sample units being invariant and replications of each other.
(·) Incentives are adminstered independently and in controlled dosages.
(·) Test procedures, measurements, and immediate environments (setting of trials, context of sample units) are kept 100% standardized, controlled and constant over sample size N.
(·) Any possible impact (i.e. variance) from other variables within the setting (outside the immediate scope of research) being firmly excluded, i.e. held isolated or fixed, to exclude them as possible confounders - or otherwise have them systematically varied, accurately measured and taken into account as covariants in a multivariate analysis.

•

**Of different order.**

Correlations are inherently bi-directional (the derivational direction of their derivates like regression equations being completely arbitrary to chose).
Hoewever, most causal dependencies in the 'real world' (physical domain above quantum levels) are (mainly) uni-directional (e.g., "smoking causes cancer") and rarely or only partly reciprocal of nature.
Building uni-directional models requires testing on additional criteria.

(·) Chronology being measured as opposed to simultaneity, and only one-directional (to exclude e.g. reciprocity, common causal factors, etc.);
(·) Latency time being in accordance with the suspected mechanism, and fairly constant over all cases, (e.g., to exclude intermediate factors).

•

**Needs high levels to have predictive power.**

The point often, if not almost, overlooked, even massively by academic scholars, is that for causality to assume you have to prove that explanatory / predictive power of the relation is better then random, the independent value rendering at least 1 bit or more information about the dependent value:

NB¹ The widely overvalued criterium of "(statistical) significance" is merely irrelevant to the question of predictive power. At the usual minimum (alpha) levels of 1 to 5 %, it is already attained at very low correlation values. Even then, it only tells that the effect-difference found is "not entirely attributable to chance", so there's no considerable proof for refutation (yet), which is not at all the same as "therefore reasonable positive proof for confirmation".
NB² Indeed, we should respect Popper's principle of not using affirmative results as conclusive 'proof', confirmation or indicating probability of the hypothesis being true, but only as corroborative to the hypothesis, reflecting the degree to which it has been tested and not yet been falsified. This is however entirely different of using a marginal probability (like ≤ alpha) of not yet being falsifiable as an evidence based corroboration - while lacking evidence for that conclusion.
To elucidate this futher: in most, if not all cases, correlations passing the test of " (statistical) significance" render proportions of explained variance far lower then 50%: the variables not even being able to predict their mutual overall variance (which still not tells anything about specific values) outside the realm of pure chance.

Just surpassing the minimum level of random guessing already requires very high correlation values

(Cf. Values of first significance, optimized.
Predictive power of Sample Patterns to Individuals.)

•

**No clue of missed covariants.**

In general, when correlation is less than perfect (±1), the missing part hints at an actual impact of covariants, like alternative (disjunct) or necessary (conjunct) causal factors, - like water or absorption - but doesn't yet identify or measure this.
When relevant, a multitude of covariants would nessecitate a multivariate analysis design to incorporate their respective impacts.

•

**No clue of covert confounders.**

A correlation of whatever value doesn't in itself reveal whether it is "clean" or spurious: in the last case, to some extent "vexed" or "polluted" by "hidden" variables, or confounding factors, that again may consist of alternative (disjunct) or necessary (conjunct) causal factors, because of crucial conditions insufficiently ensured during the experiment (e.g., double-blindness, placebo-control, randomization, balancing, stratification, matching/ pairing, standardization, constancy, isolation, etc.).

In short, correlation is never a conclusive indicator of causation.
In RCTs, a relatively high correlation value of e.g. 0.8 hints at a link, but stil needs considerable unpacking, testing and checking on many requirements before causation can be established. The many nuances and complications are even multiplied and deepened when performed in the context of present day LLM based AI systems.

3. **Using Correlation in AI**

In the AI community, LLMs are often said to use "implicit correlations", computing correlation coefficients "indirectly", or performing (some kind of) correlation-based calculation like multivariate analysis, etc. "inherently". This is however nonsense.
Correlation is a precise mathematical concept (Pearson, Spearman, Kendall, etc.) that requires specific and explicit calculation over paired samples of variables. If refered to as "implicit ", correlations should be demonstratively derivable, and be reproduced instantly, at any moment, e.g., at a users request of quiery.

Correlation versus AI.

•

Not standard repertoire in AI.

Current LLMs and pattern learning models do not "use correlations" in any technical or statistically valid sense unless explicitly instructed to calculate them. What AI machines do is "learning" complex relationships through training on large datasets and generalize co-occurrence frequencies, regardless of statistical criteria like representativeness, predictive power and reliability. That is inevitably and fundamentally different than computing or using any standard form of explicit or implicit correlation metric.
Using "correlation" loosely to describe such learned statistical dependencies is misleading at best and incorrect at worst.

•

Hard to obtain in AI.

Of course, AI systems have many trillions of examples of examples available, so huge N in principle. In theory that would provide a near-perfect basis for computing correlations.

(·)

Data Sorting Nightmare.

Valid correlations require controlled, standardized conditions across massive N, which is near-impossible for noisy, real-world data (e.g., social media posts). Sorting and categorizing trillions of examples into experimental/control groups demands astronomical preprocessing, clashing with "near-term " optimism.

(·)

Multivariate Mess.

Mapping real-world causality often needs multivariate analysis, including techniques like multivariate regression, principal component analysis, and others, that allow for the examination of interdependencies among several variables.
This however explodes computational and data requirements.
In ML, models like decision trees, random forests, and so-called "neural" networks are used to simulate some kind of multivariate analysis by considering multiple input features to make predictions.

Thus, it's practically impossible for AI machines to sort such massive, noisy data (e.g., trillions of examples) under the controlled conditions that are required, also given the complications of uni-directional dependencies and multivariate analysis.
Even with huge N, ensuring invariant samples and stable contexts is a huge systemic hurdle.

•

**Limited to Co-occurances.**

In NLP, dependency parsing involves identifying directed relationships between words in a sentence, such as subject-verb-object structures. It's analysis remains restricted to syntactical surface structure, but it doesn't even build or reconstruct a solid grammer composition (and thus tends to commit numerous errors in detecting scope of constituents, embeddings, etc.).
What is doesn't perform at all is analysis of semantic structure, logical or reasoning structure, psychological structure, and certainly not causal structure.

•

Crawling history becomes sample.

In reality, the samples that LLMs use consist of all instances of texts they processed untill a point in time: in principle, all thinkable ways people can talk about the world.
The population about which LLMs perform their predictions however consists of the entire " real world" or even "integral universe", including its fundamental domains of phenomena and core dimensions of information: language/ communication, (or syntax-to-semantic relations), logic (or abstract patterns), causality and psychological structure (or mental patterns).
This huge mismatch between samples and populations constitutes an immense gap that seems impossible to bridge in order to derive valid generalizations.

4. Testing on Criteria for Causality in N=1 Cases

The following isn't a loose collection of checks - it's a unified causal logic, systematically testing every link to validate or debunk it. Here's the complete process, tying research (N>1) to N=1 contexts.

•

Temporal sequence.

Causality demands order in time, chronology: effect follows cause.
In randomized controlled trials (RCTs), researchers log treatment (A) before outcome (B) - drug given, then pain drops, tracked across many.
For N=1, verify: did the pill come before relief? A record like "took it at 3 PM, better by 3:20" nails it. If relief hit first, the claim's sunk unless timing's wrong.

•

Mechanistic plausibility.

RCTs test if a process - like a drug's chemistry - links A to B.
In N=1, the question is: does science support it? If the pill's compound dulls inflammation, it's plausible. "a shout stopped rain" isn't without a wild tie.
Real life situations however often lack tests from scientific research that reveal the mechanisms involved, and need additional checks.

•

Experimental condition.

This is crucial - does A trigger B?
RCTs give A to a treatment group and measure B - drug administered, pain fades.
Causality needs this active test.
For N=1, we test if B occurs with A present: "take pill, relief follows," maybe in a repeat (real or modeled). Without this test, we can only guess if the pill did anything.
However, finding an affirmative example of "true positive" case is by far not enough. We have to actively and purposefully search for any counter-example or "black swan" in the experimental condition: is by any chance a situation to be found or reasonably possible, in which A occurs without B rising (or varying)?

•

Control condition.

RCTs skip A in a control group - if pain stays, the drug matters.
But it's wider: could water, not pill, have done it? Thus we may rank covariates - alternative factors - by fit.
This test isolates A's necessity; skip either, and rivals cloud the picture.
For N=1, check if B stays out with A absent: "no pill, headache lingers," using a baseline.
In the control condition we also need to actively and purposefully search for any counter-example or "black swan": can a situation be found or reasonably possible, in which A stays absent (or constant) but with B nevertheless appearing?

•

Proportionality.

Research matches cause to effect - stats like effect size show a drug's impact scales with relief.
For N=1, ask: does the pill's dose fit the relief? A small pill easing a migraine works if potent; a tap sinking a ship needs massive leverage (e.g., hull flaw). Causality demands balance - effect scales to cause.

•

Correlation.

In a true N=1 case, calculating a meaningful correlation is impossible. We may find one incident of co-occurence, which might show pill and relief align - possibly useful for further exploration, yet very limited.
Sometimes a suitable correlation may already be available from prior research to be applied for the specific case at hand. In general hoewever, such deductive applications have numerous problems and complications as we've discussed above.

•

Fitting latency time.

Research measures delays - 30 minutes for a pill to work, averaged over patients.
For N=1, check: 20 minutes for relief fits pharmacology; 2 seconds doesn't. Off timing breaks the link - causality needs a realistic pace.

•

Checks on Intermediate causes.

RCTs model steps - drug boosts blood levels, then eases pain.
In an N=1 case, trace: "pill taken, absorbed, relief." No chain - like "yell caused blackout " - weakens it. Causality often rides on these steps, not just start to finish.

•

Checks on Common cause.

Trials control for a third factor - like stress driving both A and B.
This checks if A's the real driver or only a co-passenger.

•

Consistency over replications.

RCTs replicate their findings - drug works across labs, it's solid. For N=1, test: does " pill eases pain" hold per pharmacology everywhere? If it defies known rules, it's highly questionable. Causality isn't a quirk - it fits reality's frame.

3. Wrap-Up

This N=1 design - sequence, mechanism, latency, experiment, control (with alternatives), proportionality, correlation, intermediates, common cause, consistency - builds a causal chain. For "pill eased headache, " it's true if: pill's first, biology fits, timing's right, pill triggers relief, no rest or water steals it, relief matches pill's power, and science backs it. Correlation may flag confounders but can't seal it - only this full logic turns one case into proven cause.