Why grouping fruit and vegies together in an interventional study is probably a bad idea.
In this blog post I want to look at nutrition groups. Specifically, I want to look, in an objective way, at the nutrition profile of fruit compared to other food groups. In nutrition studies, fruit is often grouped with vegetables, but is this actually a fair grouping? I want to use a public nutrition database and some basic Python Pandas functionality to look if this is justified.
Getting the data
We start of wit getting some nutrition ingfo from https://fineli.fi/fineli/fi/avoin-data
The unpacked zip file contains a number of csv files that we will load into pandas.
%matplotlib inline
import math
import numpy as np
import pandas
import matplotlib.pyplot as plt
component_value = pandas.read_csv("component_value.csv", sep=';', decimal=',')
food = pandas.read_csv("food.csv", sep=';', encoding='latin1')
foodname = pandas.read_csv("foodname_EN.csv", sep=';', encoding='latin1')
fuclass = pandas.read_csv("fuclass_EN.csv", sep=';')
component_value = component_value[component_value['EUFDNAME'].apply(lambda x: isinstance(x, (str)))]
eufdname = pandas.read_csv("eufdname_EN.csv", sep=';')
Normalizing the data
The next step is to normalize the data on nutrients, so we can work with normalized vector distance from here on.
The way we do this is, we take the mean and standard deviation for each of the nutrients in the nutrition database and we use this info to normalize the nutrient numbers to z-values. We create a new data frame with foods as rows and normalized nutrients as columns.
df = pandas.merge(left=food[["FOODID","FUCLASS"]], right=fuclass[["THSCODE", "DESCRIPT"]], \
how='left', left_on="FUCLASS", right_on="THSCODE")[["FOODID","DESCRIPT"]]
foodshort = foodname[["FOODID","FOODNAME"]]
df = pandas.merge(how='left', right=df, left=foodshort, left_on="FOODID", right_on="FOODID")
for comp in component_value["EUFDNAME"].unique():
filtered = component_value[component_value["EUFDNAME"] == comp][["FOODID","BESTLOC"]]
std = filtered.loc[:,"BESTLOC"].std(axis=0)
mean = filtered.loc[:,"BESTLOC"].mean(axis=0)
filtered[comp] = (filtered["BESTLOC"] - mean) / std
filtered = filtered[["FOODID", comp]]
df = pandas.merge(left=df,right=filtered, how='left', left_on='FOODID', right_on='FOODID')
df = df.fillna(0)
Food groups
Now that we have our normalized data, lets have a look at fruit, as a group, and see how that group compares to the other groups in our data set.
fruit = df.loc[df['DESCRIPT'] == 'Fruits']
vegies = df.loc[df['DESCRIPT'] == 'Vegetables']
reference_vectordistance = np.linalg.norm((vegies.mean() - fruit.mean()).values[1:])
rowlist = []
for foodtype in df["DESCRIPT"].unique():
if foodtype != 'Fruits':
other = df.loc[df['DESCRIPT'] == foodtype]
vectordistance = np.linalg.norm((other.mean() - fruit.mean()).values[1:])
if vectordistance/reference_vectordistance < 10.0001:
row = dict()
row["foodtype"] = foodtype
row["reldistance"] = vectordistance/reference_vectordistance
rowlist.append(row)
peergroups = pandas.DataFrame(rowlist)
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(peergroups.sort_values(by=['reldistance']))
foodtype reldistance
106 Baby fruit and berry product 0.254584
24 Juices 0.324371
23 Fruit and berry salads 0.419500
88 Vegetable salads 0.433896
27 Juice drink 0.499597
89 Fruit and berry soups 0.512050
82 Fruit and berry dishes other than pies 0.572982
67 Other drinks 0.619280
84 Vegetable soups 0.652777
107 Baby vegetable product 0.655973
65 Soft drink with sugar 0.710715
20 Vegetable juices 0.716880
95 Pulse soups 0.729302
83 Potato dishes 0.742103
86 Cooked vegetables 0.744533
90 Vegetable sauces 0.753597
87 Vegetable dishes 0.764830
109 Baby fish dish 0.774588
68 Drinking water 0.778795
97 Meat soups 0.780833
42 Yoghurt 0.805619
93 Pulse sauces 0.806330
38 Milks skimmed 0.809094
108 Baby meat dish 0.825446
70 Porridge 0.829892
116 Sport drink 0.831736
100 Poultry soups 0.842000
41 Cultured milks 0.845855
59 Coffee 0.846159
60 Tea 0.850300
15 Cooked potatoes 0.889403
85 Mixed salads 0.897260
94 Pulse dishes 0.897811
37 Milks >2% fat 0.907533
39 Soured milks 0.912002
81 Milk desserts 0.920313
44 Milks <2% fat 0.928574
69 Drinks with artificial sweeteners 0.935823
112 Seafood soup 0.936404
79 Milk sauces 0.999358
17 Vegetables 1.000000
43 Quark 1.000320
120 Dietary supplement 1.013490
5 Savoury sauces 1.037593
25 Berries 1.054466
103 Fish soups 1.064546
16 Fried potatoes, French fries 1.106613
91 Prepared salads with mayonnaise 1.109879
78 Panncakes 1.132018
101 Poultry dishes 1.169911
19 Mushroom dishes 1.174364
49 Ice cream 1.194230
111 Seafood dishes, crustacean and molluscs 1.225429
96 Meat sauces 1.226042
92 Meat dishes 1.229997
105 Dessert sauces 1.237735
40 Fermented milk products, other 1.259492
102 Poultry sauces 1.285769
12 Rice 1.313129
45 Cream 1.329619
110 Seafood sauces 1.334021
118 Pizza 1.337806
73 Savoury bakery 1.343459
10 Pasta 1.345270
4 Condiments 1.378994
75 Sandwiches and burgers 1.403498
74 Sweet bakery 1.447861
26 Jams and marmalades 1.461037
66 Ciders 1.513203
104 Fish sauces 1.523617
77 Buns 1.609077
80 Egg dishes 1.611915
98 Fish dishes 1.682396
76 Wheat bread 1.689758
99 Sausages 1.708877
72 Bread, mixed flour 1.730758
21 Pulses 1.739442
46 Cheese, unripened, fresh cheese 1.739644
18 Canned vegetables 1.781269
117 Cold cuts, sausages 1.807501
57 Crustaceans and molluscs 1.904892
53 Cold cuts, meat 1.998762
71 Rye bread 2.126735
114 Savoury biscuits 2.146351
11 Sweet biscuits 2.227579
3 Miscellaneous ingredients 2.309356
52 Chicken and other birds 2.392425
64 Other alcoholic beverages 2.423097
7 Cereal bars 2.439470
48 Processed cheese 2.451552
50 Steaks and chops 2.459782
13 Breakfast cereals 2.524057
8 Flour 2.544412
22 Pulse products 2.647546
55 Fish 2.676576
47 Cheese, ripened cheese 2.809589
14 Savoury snacks 2.811795
61 Beers 2.856793
115 Meal replacements 2.947267
2 Chocolate 3.004876
62 Wines 3.229641
1 Confectionery 3.240278
34 Blended spread < 55 % 3.340992
113 Infant formulas and human milk 3.342981
36 Margarine and fat spread < 55% 3.349092
9 Nuts, seeds and dried fruits 3.505547
119 Sport food 3.511651
51 Meat products 3.523952
58 Egg 3.555769
56 Fish products 3.588763
33 Salad dressings and mayonnaises 3.628788
6 Spices 4.116614
29 Blended spread >= 55 % 4.244458
54 Offal dishes 4.397559
0 Sugar and syrups 4.436681
35 Margarine and fat spread >= 55% 4.517717
28 Butter, milk fat 5.055117
30 Cooking and industrial fat 5.283595
32 Animal fat 5.471986
31 Oils 7.149777
63 Spirits 8.044268
Notice how SSBs as a group are 29% closer, nutritionally to fruit as a group than vegetables as a group are. At least according to our simple metric. Even drinking water and yogurt are. This isn't exactly giving us much to justify fruits being grouped with vegetables in nutition studies.
And this is just for the distance between the mean of these food groups. Lets pick a random fruit, lets say a banana and compare it different individual foods outside of the vegetables group.
A Specific fruit
We looked at this for groups, now lets look at a specific fruit. One of my own favorites, a melon. And lets not look at SSB, drinking water and yogurt, but lets look at foods generally thought of as unhealthy that few people will think of comparing to a healthy peice of fruit. We take a look at McDonalds food and at chocolates and see how they compare to a melon.
banana = df.loc[df['FOODNAME'] == 'HONEYDEW MELON, WITHOUT SKIN']
for header in df.head():
if not header in ["FOODID","FOODNAME","DESCRIPT"]:
df[header] = df[header] - banana[[header]].values[0]
rowlist = []
for index,row in df.iterrows():
food = row.values[1]
foodtype = row.values[2]
vector = row.values[3:]
distance = np.linalg.norm(vector)/reference_vectordistance
if "MCDONALD" in food or foodtype == "Chocolate":
row = dict()
row["food"] = food
row["distance"] = distance
rowlist.append(row)
peerfood = pandas.DataFrame(rowlist)
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(peerfood.sort_values(by=['distance']))
distance food
21 0.956963 MILKSHAKE, VANILLA, MCDONALD'S
23 1.515775 HAMBURGER, MCFEAST, MCDONALD'S
10 1.589669 HAMBURGER, BEEF AND WHEAT ROLL, MCDONALD'S
11 1.647408 HAMBURGER, CHEESE BURGER, MCDONALD'S
13 1.647526 HAMBURGER, DOUBLE BURGER, BIG MAC, MCDONALD'S
12 1.728808 HAMBURGER, CHICKEN BURGER, MCDONALD'S
22 1.986608 HAMBURGER, DOUBLE CHEESE BURGER, MCDONALD'S
1 2.542666 CHOCOLATE CONFECTION FILLED WITH MARMALADE
14 2.915553 CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX
24 2.915631 CHOCOLATE CONFECTION FILLED WITH CHOCOLATE
18 2.978043 SUFFELI CHOCOLATE BAR, WAFFLE, TOFFEE FILLING ...
7 3.174270 CHOCOLATE BAR, LOW-FAT
6 3.224222 CHOCOLATE BAR WITH FILLING, AVERAGE
2 3.336511 CHOCOLATE BAR, AVERAGE
15 3.350807 SUFFELI PUFFI SNACKS,PUFFED CORN AND CHOCOLATE...
3 3.354410 CHOCOLATE, PLAIN, DARK CHOCOLATE
8 3.409912 CHOCOLATE, WHITE CHOCOLATE
0 3.431798 CHOCOLATE, AVERAGE
16 3.469495 CHOCOLATE NUT SPREAD
20 3.509146 CHOCOLATE, MILK CHOCOLATE WITH HAZELNUTS
4 3.763362 CHOCOLATE, MILK CHOCOLATE
17 3.789385 KINDER CHOCOLATE EGG
9 3.863914 RICE CHOCOLATE
19 4.175258 CHOCOLATE, PLAIN, DARK CHOCOLATE, 80%
5 5.557062 CHOCOLATE, ARTIFICIALLY SWEETENED
Notice that a milk shake is closer to a melon than the average vegetable. Now let us pick a few nice ones from this lis. The milkshake, the double cheese burger and the twix candy bar and see how different vegetables compare to these:
count1 = 0
count2 = 0
count3 = 0
tcount = 0
for index,row in df.iterrows():
food = row.values[1]
foodtype = row.values[2]
vector = row.values[3:]
distance = np.linalg.norm(vector)/reference_vectordistance
if "Vegetables" == foodtype:
tcount += 1
if distance > 2.915553:
count3 +=1
if distance > 1.986608:
count2 +=1
if distance > 0.956963:
count1 +=1
print("* A milkshake is nutritionally closer to a melon than", count1,"out of", tcount,"vegetables.")
print("* A double cheeseburger is nutritionally closer to a melon than", count2,"out of", tcount, "vegetables.")
print("* A Twix candy bar is nutritionally closer to a melon than", count3,"out of", tcount,"vegetables.")
* A milkshake is nutritionally closer to a melon than 71 out of 103 vegetables.
* A double cheeseburger is nutritionally closer to a melon than 26 out of 103 vegetables.
* A Twix candy bar is nutritionally closer to a melon than 13 out of 103 vegetables.
Still making sense to you to run interventional studies that put vegetables and fruits in the same group? I would argue it doesn't.
But then, maybe you don't trust the normalized nutrition vector. Lets have a quick look at what the normalized nutrition actually looks like for a banana vs brocoli, kale, twix and a McDonald's milkshake.
compare = ['MILKSHAKE, VANILLA, MCDONALD\'S','KALE','BROCCOLI','CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX']
part = df.loc[df['FOODNAME'].isin(compare)]
part = part.set_index('FOODNAME').drop(['FOODID','DESCRIPT'], axis=1)
part = part.transpose().rename(columns={"CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX": "TWIX",
"MILKSHAKE, VANILLA, MCDONALD'S": "MILKSHAKE"})
names = eufdname.drop(['LANG'], axis=1).rename(columns={"THSCODE": "FOODNAME"}).set_index("FOODNAME")
#pandas.merge(how='left', right=names, left=part, left_on="FOODNAME", right_on="THSCODE")
pandas.merge(how='left', right=names, left=part, left_index=True, right_index = True).set_index("DESCRIPT")
#names
BROCCOLI | KALE | TWIX | MILKSHAKE | |
---|---|---|---|---|
DESCRIPT | ||||
energy,calculated | -0.014435 | 0.008692 | 2.939874 | 0.280767 |
fat, total | 0.026593 | 0.046368 | 1.644287 | 0.140263 |
carbohydrate, available | -0.296969 | -0.291666 | 2.667469 | 0.195103 |
protein, total | 0.403420 | 0.255376 | 0.251675 | 0.276349 |
alcohol | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
organic acids, total | 0.100330 | -0.328305 | -0.446863 | -0.122195 |
sugar alcohols | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
sugars, total | -0.572030 | -0.562058 | 4.221419 | 0.366879 |
fructose | -0.299162 | -0.260447 | 0.716580 | -0.457189 |
galactose | -0.131067 | -0.131067 | -0.156703 | 2.268470 |
glucose | -0.277072 | -0.277072 | 1.086415 | 0.312071 |
lactose | 0.000000 | 0.000000 | 0.495146 | 0.566464 |
maltose | 0.056397 | 0.056397 | 0.479377 | 0.028199 |
sucrose | -0.539437 | -0.539437 | 4.571218 | 0.135884 |
starch, total | 0.008880 | 0.008880 | 0.518603 | 0.000000 |
fibre, total | 0.590525 | 1.290406 | 0.171471 | -0.174970 |
fibre, water-insoluble | 0.490057 | 0.285866 | 0.314453 | -0.163352 |
polysaccharides, non-cellulosic, water-soluble | 0.572664 | 0.572664 | 0.155437 | -0.163618 |
folate, total | 1.546457 | 1.643241 | 0.034257 | 0.031612 |
niacin equivalents, total | 0.319264 | 0.188485 | 0.116861 | 0.136289 |
niacin, preformed (nicotinic acid + nicotinamide) | 0.241677 | 0.281957 | -0.039877 | -0.079753 |
vitamers pyridoxine (hydrochloride) | 0.224716 | 1.303355 | -0.134830 | -0.044943 |
riboflavine | 0.657697 | 1.176932 | 0.169617 | 0.636928 |
thiamin (vitamin B1) | 0.369891 | 0.475574 | -0.065524 | 0.030120 |
vitamin A retinol activity equivalents | 0.080351 | 0.734835 | 0.025000 | 0.015968 |
carotenoids, total | 1.107892 | 16.470065 | -0.005260 | -0.018312 |
vitamin B-12 (cobalamin) | 0.000000 | 0.000000 | 0.011532 | 0.087641 |
vitamin C (ascorbic acid) | 2.999856 | 3.037288 | -0.475378 | -0.419766 |
vitamin D | 0.000000 | 0.000000 | 0.004403 | 0.004403 |
vitamin E alphatocopherol | 0.244547 | 1.993427 | 0.420917 | 0.007411 |
vitamin K, total | 2.221502 | 12.518243 | 0.049457 | 0.007459 |
calcium | 0.095603 | 1.051637 | 0.113176 | 0.487941 |
iron, total | 0.207805 | 0.221893 | 0.202170 | -0.016906 |
iodide (iodine) | -0.018761 | -0.018761 | -0.017968 | -0.016768 |
potassium | 0.469394 | 0.678014 | -0.061256 | -0.120452 |
magnesium | 0.197976 | 0.395951 | 0.327828 | 0.044940 |
salt | -0.012421 | -0.018237 | 0.108501 | 0.012049 |
phosphorus | 0.361836 | 0.200401 | 0.228235 | 0.406369 |
selenium, total | 0.177199 | 0.177199 | 0.027749 | 0.046444 |
zinc | 0.390879 | 0.188201 | 0.319218 | 0.290988 |
fatty acids, total | 0.013528 | 0.036739 | 1.618932 | 0.141260 |
fatty acids, total polyunsaturated | 0.036390 | 0.107449 | 0.432253 | -0.001229 |
fatty acids, total monounsaturated cis | 0.002320 | 0.003099 | 1.345681 | 0.080810 |
fatty acids, total saturated | 0.004420 | 0.009818 | 2.052560 | 0.233120 |
fatty acids, total trans | 0.000000 | 0.000000 | 0.241665 | 0.147224 |
fatty acids, total n-3 polyunsaturated | 0.073295 | 0.236547 | 0.023455 | -0.017662 |
fatty acids, total n-6 polyunsaturated | 0.009726 | 0.033363 | 0.535066 | 0.002518 |
fatty acid 18:2 cis,cis n-6 (linoleic acid) | 0.010242 | 0.035108 | 0.560131 | 0.001131 |
fatty acid 18:3 n-3 (alpha-linolenic acid) | 0.077831 | 0.251260 | 0.024947 | -0.018711 |
fatty acid 20:5 n-3 (EPA) | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
fatty acid 22:6 n-3 (DHA) | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
cholesterol (GC) | 0.000000 | 0.000000 | 0.107896 | 0.071424 |
sterols, total | 0.197817 | 0.039677 | 0.176732 | -0.010203 |
tryptophan | 0.256079 | 0.843467 | 0.256079 | 0.342292 |
I hope the simple analysis above shows and justifies my stance that fruits and vegetables grouped together in an interventional study is a horrible idea. Whatever te outcome, it will say very little about either fruit nor vegetables.
Note that in this analysis I didn't put any weight on any of the nutrients other than the data set did by grouping or not grouping nutrients together. Also the comparison is based on a per unit of weight basis. The results on a per calory basis are different but the same. Different in that other groups turn up as closer to fruit than vegetables, but the same in that fruits and vegetables turn out very much different and more different than many other obviously unrelated food groups in this data set. As I didn't want to make this blog post longet hant it already is, I ommitted the per kcal variant.
As you might have noticed, I am more comfortable with data than I am with biochemistry, so there might be major issues with analyzing the distance between different foods in the way that I did above. I'm here to learn, so if there are fundamental flaws with this way of looking at the data, please drop me a comment, or let me know on Twitter.
Wow! That is an impressive collection of data. I haven't had a chance to actually analyze your findings, but these are the kinds of lists that invite attention (mine, anyway). Compliments on this hard work and for looking at an accepted idea and challenging it with science.
Respect!
Interesting stuff. Data science FTW!
This post has been voted on by the SteemSTEM curation team and voting trail. It is elligible for support from @curie and @minnowbooster.
If you appreciate the work we are doing, then consider supporting our witness @stem.witness. Additional witness support to the curie witness would be appreciated as well.
For additional information please join us on the SteemSTEM discord and to get to know the rest of the community!
Please consider using the steemstem.io app and/or including @steemstem in the list of beneficiaries of this post. This could yield a stronger support from SteemSTEM.
@pibara You have received a 100% upvote from @intro.bot because this post did not use any bidbots and you have not used bidbots in the last 30 days!
Upvoting this comment will help keep this service running.
Thank you @pibara for reducing your CO2 footprint with the CO2 Compensation Coin (COCO) 👍 @co2fund