As some of you will know, my interest in nutrition and medical science is driven primarily by my desire to continue to beat the odds concerning my own medical predispositions. Cardiovascular disease is quite high in my family. I myself had my first (of multiple) cardiovascular event in my early twenties. Judging from the survival rate of family members after such events, as well as looking at data from studies, I’m probably pretty darn lucky to still be alive today.
Over the last number of years I’ve been working hard to assist my luck by reading all that I can read about biochemistry, nutrition, epidemiology and anything related to cardiovascular disease, all primarily to find out what lines of action would be most likely to help me continue on the lucky streak that I have been on for over two decades now. As I don’t have a background in either medical science or nutrition, and as I do have a solid background in forensics, computer science, information security data science and data engineering, whenever I read a nutritional science or medical science paper, my first impression of the papers tends to be dominated by the (often poor) design of the trial, especially with respect to data structures, the (often substandard) use of statistical and other mathematical tooling as applied on the resulting data, and last but not least, the (often impetuous) proposed risk mitigation strategies. Some of these subjects I discussed in my earlier rant, so I won’t go deep into those again. But some I didn’t. Today I will try and focus primarily on causality claims and their implementations.
More often than not, both for my personal survival use, and sometimes for use in my ongoing work on my ‘fat, veggies and feedback loops’ book, the paper isn’t that useful to me, given my limited knowledge of biochemistry and medical science anyway, and I will try to obtain the original data from the studies so I can apply my own skill set to the data and see just how strong the claims are that are made in the paper for as far as the data set allows for such analysis. That is often where my second issue with nutritional science and medical science rears its head: IRB-firewalls and medical-qualification firewalls. Especially for US based studies, many data sets, although often proclaimed to be open, are fire-walled due to regulations regarding access to potentially sensitive information. In order for me to gain access to the data, even for personal survival usage, I would need to write a research proposal and get an Institutional Review Board to review and approve that proposal. While the raw data from EU-based studies usually are usually easier to get at, especially when I mention my background and make it clear It is for personal survival usage only, some interesting data-sets from EU studies are still fire-walled and inaccessible to me, mostly due to the reported need for med-sci qualifications. This while apparently med-sci people don’t seem to need any real data-sci qualifications in order to write papers using inappropriate mathematics or applying sometimes flawed causality reasoning to data, but I won’t go into that again here. The main point is, some considerate nutritional and medical scientists do see it fit to share their data with me, helping me to slightly improve my personal chances of surviving yet another year, and I am massively grateful for that. So much so, and some people may find this unethical, that I take great care not bite the hand that may be keeping me alive. So much different it is for US studies though. Data is almost always completely and utterly impossible to come by for someone with my back ground if the study was conducted by a US based institute. To the extent that I’ve decided to simply start ignoring US originating medical and nutritional science papers. This may sound completely stupid for someone in my position, but really it isn’t. Next to genetics, lifestyle and diet, stress can be a major risk factor for cardiovascular disease, something that my personal CV history seems to align with. I practice stress avoidance as a way of life. For me IRB firewalls have been a recurring source of stress in my struggles of gathering potentially life prolonging data from interesting studies. It would be ironic if the stress from failing to get my hands on life-prolonging data would end up being the thing that cuts my lifespan short. Better to just forget about the US altogether and focus on the more considerate and cooperative part of the nutritional and medical science world. But enough now about data fire-walling. I want to focus on discussing causality reasoning in nutritional and medical science. Medical, but especially nutritional scientist will throw around causality claims, but are they justified?
To understand causality, and to understand where causality reasoning can go wrong,it is easiest to look at a model we create ourselves first. This may need a bit of explanation. You can simulate something based on your understanding of how things work. You can simulate something based on wild unrealistic alternative laws of nature. And you can simulate something using ‘in between’ models. One thing these models all share is that you know exactly how the model works and no guesswork is needed on the models themselves. The great thing about running simulations with made up parameters is that all links are indisputable and faulty reasoning can thus easily be exposed. Define a model, a stochastic model, of a fictitious reality. Introduce undisclosed sources causality into the model and test the scientific process for discovering causal variables. If you find spurious causality in such a model and the scientific process justifies claiming it as truly causal, than you will have exposed a flaw in the scientific process for assessing causality.
To show what I mean, let us do a bit of a thought experiment. Let us define a simple yet still potentially confusing model. One of my hobbies is writing speculative fiction and world building is one of the most fun parts of writing fiction. So let us do a bit of Sci-Fi world building , but with a causality reasoning twist to it. Lets imagine a space ship with immortal aliens crash landed on an habitable planet. After having exhausted ideas of repairing the space ship and leaving the planet they crash-landed on, the aliens have settled on the planet, living in a hand full of communes. While living conditions were quite primitive for a while, the aliens have now managed to bootstrap a bit of comforting technology in recent decades. Somehow however, while they were quite immortal in the past, immortal even while they were still living primitively after the crash on the planet, the last decade, aliens have started to die from previously unseen causes. The newly acquired causes of death, ordered from high to low mortality numbers are:
- Severe burns
So what could be the cause? While their technical resources on their new planet are still limited, the aliens are still quite adept at basic logical reasoning, so they set out to get to the bottom of the mysterious deaths. The first bit of data:
- Crash landed on planet 60 years ago.
- Number of deaths started rising quickly 10 years ago.
Not much to go on, but with a bit of inductive reasoning it seems only logical that prolonged exposure to something on this planet has to be causal to our aliens dying, right? So on this premise some of the aliens set out to methodically look at multiple aspects of their new planet. Their immortal species had been living on many planets for millennia without even a single death from apparently spontaneous dismemberment. The number one cause of death for the last decade with a number of mortalities that keeps going up every year now. First they looked at the food. While lacking animal life on their new planet, their diet had been vegetarian since arrival, historic records showed that large groups had lived for over a whole century on a purely vegan diet, so that theory could quickly be dismissed. On to the next one: atmospheric composition. The nitrogen levels on this planet were slightly lower on this planet compared to each of the other planets the species had inhabited, and both oxygen and different forms of oxidized carbon were slightly higher. Now that was something to work with. One of the communes, commune Gamma, had settled relatively high in the mountains, somewhere where due to the thinner atmosphere, the amount of each of these three gases was lower. If the nitrogen was the problem, surely death rates should be higher in the mountain commune. If it was either the oxygen or the oxidized carbon, mortality would be lower. So it didn’t take long before the first epi data come in. The results were shocking. The mortality rates due to dismemberment, suffocation and severe burns were much and much lower than in any of the lower altitude communes.
- Commune Alpha: 100/100.000
- Commune Bravo: 110/100.000
- Commune Charlie: 70/100.000
- Commune Delta: 65/100.000
- Commune Gama: 5/100.000
Apparently a big piece of the puzzle was solved. The nitrogen theory could be dismissed, but prolonged exposure to either oxidized carbon or oxygen had to be the causal factor here.
But how to figure out which one it was? A smart Bravo-commune doctor figured out a way to find out about the mechanisms. Tiny nano sensors were injected into over hundred thousand Bravo-commune volunteers. Sensors measuring oxygen levels, carbon monoxide, carbon dioxide and carbon trixode were being continuously monitored throughout the body. After a year the results came in from postmortem analysis of the sensor data. CO and CO2 levels shot up, while oxygen levels dropped significantly shortly before many cases of death. Verification against other sensor readings confirmed that this phenomenon did not occur to such an extent in a non fatal way. So with oxygen actually going down, we could lay to rest the oxygen theory, while the oxidized carbon theory gained more traction. In fact, we now had three converging lines of evidence. Consilience. Something that according to many medical scientists seems to be all that is needed to prove causality:
- Historical data from other planets.
- The Gama mountain community epi data.
- The Bravo-commune sensor data.
So this shows causality for oxidized carbon, right? or does it? Well, ask any data-engineer or data scientist while (s)he is drinking a soda and chances are you will have him/her laughing so hard that his/her nose would become an instant soda fountain. Ask a nutritional scientist who hasn’t scrolled down to the end of the page yet, and chances are a large percentage of them would agree with this being a solid causality claim. The smarter ones however will notice the problem: The starting hypothesis was prolonged exposure. How does the pre-death spike fit into that? The same with our aliens. While a group jumps to embrace the oxidized carbon is causal to mortality claim, a small group starts working on an alternative theory: What if oxygen is causal after all. Looking at chemical mechanisms, a rise in oxidized carbon that coincides with a drop in oxygen could indicate its the actual oxidation process that leads to mortality, making oxygen, not oxidized carbon the root cause of mortality.
As the dissident scientist aliens going against the consensus is stirring up controversy, a large Randomized Controlled Trial is started to settle the issue once and for all. The participants are made to breath a controlled gas mixture that either has its oxygen or its oxidized carbon levels lowered to test the two competing theories head to head. A third group, the placebo group, receives a gas mixture that has the exact same composition as the planet’s atmosphere. To the surprise of everyone, there is absolutely no statistically significant difference between the mortality of the three groups. More surprising though is that all cause mortality and mortality from suffocation are lower for each group than they are for the general population. Now instead of the two theories being reconsidered, the two camps entrench themselves behind their respective theories. The Oxygen theory scientists see the results as debunking the oxidized carbon theory, while confirming their own theory in such a way that it confirms that preventing the exposure of oxygen to ‘external’ oxidation is somehow protective. The oxidized carbon theory supporters simply extend their theory by reexamining the sensor data. They fit the data of both trails and come up with the carbon-monoxide as new and improved risk marker. So now the new competing theories as:
- Carbon monoxide exposure is causal to all three mortality causes.
- Oxygen exposure, through its oxidative effect on carbon is causal to all three mortality causes.
So who is right? Is anyone? Was any of the two camps justified in claiming to have established causality. Lets add a few more pieces to the puzzle.
Comparing different metrics from the five communes, a student discovers an absolutely massive correlation between electricity usage and mortality. Especially if commune gamma is excluded, the association is close to a straight line between the remaining data points. A perfect linear association. Given the statistical properties of the electricity mortality link that quite clearly is too strong to just ignore, scientists from both camps struggle to integrate electricity into their model. Electricity as prime cause seems impossible, as electricity was used by our alien species for many many millenniums. And then there is the Commune Gama paradox. Using electric heating, the Gama community has the highest use of electricity while having the lowest mortality. The electricity can’t possibly be causal, right? Or can it.
Looking under the hood
So let’s look at the actual mechanisms now, what is quite easy in our case as we made the whole thing up to begin with. Lets look at why all of the above is so terrible from a causality reasoning point of view, and why I think similar processes in our non-made-up world are likely to be present in nutritional and medical science. Lets look what our made up causal factors actually were.
After the spaceship crashed and the communes were established, electrical wiring was installed in all the newly built houses according to the limited use of electricity in those days. As use of technology expanded over time, so did the use of electricity, occasionally straining the old wires beyond their originally designed capacity, causing isolation to partially melt, what in some cases could lead to sparks. While the mountain commune houses were built from rocks, the other communes used wood to build their houses. Wood that could allow the electrical sparks to incept a small fire. As the aliens have found large deposits of natural gas, they have started using canisters filled with that gas for cooking with and heating their houses in winter. Often, the canisters are stored in a hallway cabinet a few feet away from a main electrical control panel. The fires in our made up world are causal to each of the three causes of death. Severe burns through direct contact with the fire, suffocation through smoke inhalation and dismemberment through gas explosions.
Causality vs provable causality
So it was electricity after all that was causal here. Could the aliens have known this? No they probably couldn’t, at least not using the data they were using. No one with a data science or data engineering background would have ever used the word ‘causal’ in any of the findings above. “Causal” is the strongest of many claims that can be made based on observed associations, and it is a claim that has no data person will dare to make without massive evidence. Not just massive in the form of converging lines of independent evidence at that. There are additional requirements for truly backing up claims of causality. One thing fundamental to causation is that correlations between variables don’t really matter all that much in terms of weight of evidence. Yes, if there is causation, or rather if there is ‘provable’ causation, there will be correlation. That is, as simulations can be made to show, there could be a causal link even if there is a negative correlation. Just that the statistical tools at our disposal are not even close to strong enough to ever truly prove causality under such conditions. A basic rule of common sense is: forget about trying to prove a causal link if there is no or a negative correlation. There are causal links that can not possibly be proven without stepping over insurmountable ethical issues, especially in medical science, and we need to accept that. Simulations can help prove the inprovability of causality, and I truly would like to advise simulation-savvy medical and nutritional scientists to add such practices to their causality reasoning toolbox. Simulate what you believe the world to behave like, then come up with a wide range of rule-sets that assign a non causal yet strong association to what you suspect being causal. Then run a shitload of simulations for each of your alternative models and look for spurious causality in your non-causal models. Given the limitations of computing power you most likely will run into, use quick and inaccurate statistical tests and patterns of natural selection first to find promising alternative worlds first, before running serious simulations with proper statistics, but these are technical details that are irrelevant to understanding the basic ideas of proving non-provability. If you can prove that you can’t prove something with available data and mathematical tooling, then claiming causality based on that data becomes educated guessing.
The real test for causality doesn’t just look at correlating variables; they (also) look at correlating time series. I won’t get into too much details, but today there are multiple tools for exploring possible causal links: Bayesian, CCM and Granger are notable examples. I’m personally not a big fan of Granger, especially in the hands of the underqualified.
In fact, as far as nutritional science and medical science are concerned, I consider Granger to be the linear regression of causality tests. Just as the uninformed usage of linear regression will spawn many spurious correlations that receive undue weight, so will the uninformed use of Granger spawn spurious causality. We have multiple statistical tools at our tools at our disposal as far as causality is concerned. One tool is particularly useful for those of us who, like me, aren’t part of the relatively small group of data science demi-gods, and that is Monte Carlo.
Build a non-causal hypothesis spawner, generate time series and repeatedly run the causality tests on each generated time series. Doing so will get you a poor-man’s meta distribution for your results and will help you weigh the risk of your finding being spurious. Basically you compensate for lack of mathematical aptitude with computer processing power, you are engineering your way through a science problem, but that’s OK as the alternative is accepting a rather high risk of allowing your mathematical hubris resulting in claiming a spurious causal link as being genuine.
In medical science, but significantly more so in nutritional science, the word ‘causal’ is used rather frivolously, and especially in nutritional science, the fear of spurious causality seems almost non existent. You can have a bit of fun with Google by trying out some ’causes’ search query with a random disease, food or nutrient, and chanced are you will find some paper for many such a query. This can be particularly stress-full and irritating if you are trying minimize your personal risk. Today’s causal treatment can be tomorrow’s causal risk factors and vice versa in nutritional science. And I’m not even exaggerating here.
Lets go back to our alien colony for a moment. Let’s have a short look at our real causality tree there and zoom in a bit. We start at the end of our tree. The mortality branches. The top cause of mortality from our list was dismemberment. One step up the causality branch we find explosions. Explosions that happen when fire comes into contact with our natural-gas canisters. From a time series perspective, the fire is clearly causal to the explosion that resulted in dismemberment. The gas canister however is a whole different matter and an interesting subject of debate. In forensics, the gas canister would be considered pre-conditionally accessory to the causality of fire with respect to the explosion. But this really is a matter of nuance in the use of terminology in different fields, so we accept that in nutritional science apparently the fire and the canister seem to have the same claim to being causal. Doing so however moves the bar. Now, to prove causality for the gas canister without knowing about fire, we need to abandon the time series approach as that won’t get us nowhere, and we will need to focus on proving the pre-conditional nature of the canister. That is, we need to show that gas canisters are part of a class of things without which a yet undetermined causal factor won’t result in an explosion. Fair enough. A problem though, without having established fire as causal, the causality of gas canisters could never be proven, period. Lacking the grounding in a time series by means of fire, the gas canisters would remain simply a very strong correlation and nothing would warrant its classification as causal at that point. We can say the canisters are causal according to the nutritional science definition of causal, but that is because we fully understand the mechanism and have identified fire as causal in the classical time-series definition of causal. Our aliens, lacking this knowledge wouldn’t have had the data to classify the canisters as causal up until the point where they would have identified the fire as (time-series) causal in the first place.
post hoc ergo propter hoc
Now let’s look at the causal factors our aliens came up with themselves. CO and CO2 for starters. Here we come to identify a common causality reasoning fallacy known as “post hoc ergo propter hoc”, the spike in oxidized carbon comes just before death, so it might be logical to assume it actually causes death. In our situation it turns out there was an unknown cause: fire, that was causing both the rice of these gasses and was causing our deaths to occur. The fact that for example death by dismemberment most commonly is a later effect of the fire than CO and CO2 changes, creates the illusion that the CO and CO2 changes are time-series causal while they are not.
The danger of incomplete causality knowledge.
The second candidate our aliens came up with was oxygen. It would be totally justified to claim low O2 levels as causal in the suffocation branch of our mortality numbers. From that perspective, installing oxygen dispensers that trigger on dropping O2 levels could be thought of as live saving device. Knowing the mechanisms of fire however, we, knowing the real cause will all realize the dangers of this line of thought. Adding extra oxygen to a burning fire might help people not dying as quickly from suffocation, it will make the fire burn that much higher and will surely only increase all cause mortality. Now what if it were discovered that O2, like our gas canisters was pre-conditionally accessory to causality in two of our three causes of death. It may seem logical to conclude oxygen levels should be lowered. Without knowing exactly when O2 reduction would be beneficial, our aliens might simply opt to go for a lifetime exposure model. Lifetime low oxygen levels could be regulated down to a level that would be detrimental to the overall health of our aliens. Trying to reduce lifetime exposure to oxygen may help lower mortality from fires, it would be akin to using a death star to get rid of a cockroach infestation in your barn.
Root cause versus central cause and surrogates.
If we look further down the causality path, we see that the root cause of all our mortality is increased use of electricity. The electricity causes melting of isolation of the wires that causes sparking of the wires that causes the wood of the houses to burn. While in our example we have seen only one cause of fire, it is quite possible that other circumstances could be leading to fires as well. As such, from a data engineering perspective, looking for root causes is considerably less likely to yield convincing causality evidence than looking for central cause. Once a central cause is identified though, looking for root causes can become considerably easier. Still though, spurious causality remains a possibility even for apparently strong central causes, so we should remain very wary about promoting any central cause no matter how convincing to the status of surrogate endpoint. Wrongfully assigning a surrogate status to a variable can potentially be disastrous. Nutritional science is plagued by surrogates. High blood pressure, high Body Mass Index, high Low Density Lipoprotein, all examples of variables that were promoted to the status of surrogate endpoints and have been spawning wild unvalidated claims from studies ever since. The dangers of promoting a possibly centrally causal variable to a surrogate is that it creates a shortcut in the scientific process, keeping essential data from being gathered in the first place, thus making the initial hypothesis unfalsifiable through economic mechanisms.
An issue with proving causality though intervention arises from statefulness. In our example, cutting of the electricity when a decrease in oxygen levels are detected would have no effect on gas explosions. As cutting off electricity will not undo the fire having started, lack of knowledge about the statefulness would keep us from discovering the electricity as being a causal factor for our gas explosions. In fact, such an interventional study may end up claiming that it has shown convincingly that electricity isn’t a causal factor. On the other hand, we can’t just dismiss such outcomes.Basically any study that doesn’t agree with the idea of causality should be closely examined and not be frivolously discarded. So how to deal with such outcomes? Basically, this study would raise the bar for causality proof. We would need to identify the stateful component, either by identifying fire as part of the causal chain, or through time series analysis. When creating a causality model, it is important to consider the existence of a hidden Markov model.
I will not go into the technical details, but the basic idea is that statefulness can often be modeled as a finite set of states combined with a set of defined state transitions between those states. Each state transition can be described a simple probability or a conditional probability, where external events (possible direct causal factors) already present conditions (possible pre-conditional accessories to causality) are part of some function for the state transition. Looking at our alien communes, we can identify a set of states in our causal chains.
- A) A base problem-free state.
- B) A state where part of the isolation of the electrical wires has melted.
- C) A state where electrical sparks arc between exposed wires.
- D) A state where there is fire burning and smoke is forming.
- E) A state where fire is close to gas canisters.
- F) A state where gas canisters are exploding.
- G) A state where someone died from fire exposure
- H) A state where someone died from suffocation.
- I) A state where someone died from dismemberment.
In our extremely simplified alien commune scenario, non of our mortality states are reachable from A without traversing the A->B->C->D states. In short we have the following chains leading to mortality.
It is in the conditional aspects of the probabilities of state transition that our causal factors hide. While an old newspapers might induce a C->D transition, eventually leading to mortality, no amount of old newspapers has any impact whatsoever on the probability of the A->B transition. So is it justified to classify news papers as a causal factor? Within the context of a Markov model it is. classifying newspapers as causal without such Markov model context however is will be highly misleading unless we identify the probability influencing factors for the preceding state transitions. Lets look a bit closer at the individual states and some variables that could influence state transition probabilities:
- Electrical current levels: A->B
- Voltage: B->C
- Dust: B->C
- Moisture: B->C / C->D
- Old papers: C->D / D->E
- Home construction (Wood/Stone): C->D / D->E
- Oxygen: C->D / D->E / *->H
- Gas canister presence: D->E
A danger of not looking at state is shown in the oxygen effects. In state C and D, oxygen is a causal factor. Yet in state D/E/F, it is a lack of oxygen that is a causal factor for H. Depending on the state and the mortality outcome looked at, intervention through active changes made to oxygen levels could either be protective or the thing that triggers our state machine to transition to its next state down a potentially mortal path.
__Asymmetry of proof __
An often overlooked concept in nutritional science is asymmetry of proof. It might not sound fair, but for proof of a hypothesis to hold, especially one claiming causality, trials and data sets are not a democracy. In principle each and every data set or trail that is well designed could claim a veto with respect to the invalidity of a hypothesis. Well not completely like a veto, it is more like an alibi in a criminal trial. You will need to convincingly show an alibi to be unreliable even if multiple persuasive pieces of evidence point to your suspect being guilty. At a fundamental level, it takes less powerful evidence to get someone acquitted than it takes to get that person convicted for a crime. That is how the legal system works is most countries with a modern notion of law. Or at least how it is designed to work. And even if someone is in jail for a crime, if new evidence is brought to light that wasn’t known at the time this person was convicted and if this new evidence cast serious doubt on the conviction, than in the legal system of many a country, the convict will not just be freed from jail, he will actually receive a lump sum in compensation for the part of his sentence already served. Basically a big ‘sorry about that’ from the legal system. Science basically works the same way, or rather, was designed to work the same way. Think about a claim of causality as you think about a criminal getting convicted. If someone in your family died from unnatural causes, a tragedy in itself, wrongfully getting convicted for murdering your own family member, getting jailed for life and becoming an outcast in your own family would make this tragedy even less bearable. A legal system that works correctly won’t stop 100% of innocent people from getting punished for things they didn’t do, it will work hard to limit the number of wrongful convictions to stay within a few percents of all convictions, accepting that this will inevitably result in some guilty individuals walking free. While nutrition might seem like a victim-less field, nobody goes to jail when meat is wrongfully convicted of causing cancer, when bread is wrongfully convicted of causing the obesity epidemic or when saturated fat is wrongfully convicted of causing cardiovascular disease, we must remember the oxygen example. A suspect variable, sometimes even a causal variable might actually be protective depending on state. Or might be protective with respect to other related causes of death or diseases. The nutritional equivalent of jailing the wrong suspect can be lethal. Especially if the conviction is made irreversible by declaring surrogate endpoints. A single piece of evidence can free someone convicted for a crime, even if that conviction took place based on dozens of pieces of circumstantial evidence as well as a few pieces of seemingly solid evidence. If you can’t explain the new evidence away, than the suspect should walk and the investigation reopened. This same concept applies to causality research in each and every field of science, apart from, apparently nutritional science and a subset of medical science. Asymmetry of proof is one of the cornerstones of the scientific process. No it’s not fair. Science is not a democracy in multiple ways. A single well designed study can yield a data set that blows away massive consilience as far as the indisputability of causality is concerned.
If we look at what in data-science constitutes solid proof of a causal relationship and especially if we look at wording used in causality reasoning with respect to statefullness, we see a major disconnect when we look at the use of the same terminology in nutritional science, and to a lesser extent medical science. Surrogates used in nutritional science tend to lack the robust causality proof that would place them in an indisputable position in the causal chain. In fact, non of the most commonly used surrogate endpoints even comes close to having the level of proof to establish it as being causal. While conflicting use of the same terminology could be acceptable, the position of surrogate endpoint assigned to variables creates perverse economical filters that prohibit solid causality geared research from continuing. Combined with an over-reliance on consilience, linear regression and Granger and the overeager dismissal of the concepts of asymmetry of proof on what the scientific process is build, from a data-engineering point of view, nutritional science should be classified as a science of educated guesses. A soft science if ever there was one. One masquerading though as a hard field of science.