# Why grouping fruit and vegies together in an interventional study is probably a bad idea.

in steemstem •  2 months ago

In this blog post I want to look at nutrition groups. Specifically, I want to look, in an objective way, at the nutrition profile of fruit compared to other food groups. In nutrition studies, fruit is often grouped with vegetables, but is this actually a fair grouping? I want to use a public nutrition database and some basic Python Pandas functionality to look if this is justified.

# Getting the data

We start of wit getting some nutrition ingfo from https://fineli.fi/fineli/fi/avoin-data
The unpacked zip file contains a number of csv files that we will load into pandas.

``````%matplotlib inline
import math
import numpy as np
import pandas
import matplotlib.pyplot as plt

component_value = component_value[component_value['EUFDNAME'].apply(lambda x: isinstance(x, (str)))]
``````

# Normalizing the data

The next step is to normalize the data on nutrients, so we can work with normalized vector distance from here on.
The way we do this is, we take the mean and standard deviation for each of the nutrients in the nutrition database and we use this info to normalize the nutrient numbers to z-values. We create a new data frame with foods as rows and normalized nutrients as columns.

``````df = pandas.merge(left=food[["FOODID","FUCLASS"]], right=fuclass[["THSCODE", "DESCRIPT"]], \
how='left', left_on="FUCLASS", right_on="THSCODE")[["FOODID","DESCRIPT"]]
foodshort = foodname[["FOODID","FOODNAME"]]

df = pandas.merge(how='left', right=df, left=foodshort, left_on="FOODID", right_on="FOODID")

for comp in component_value["EUFDNAME"].unique():
filtered = component_value[component_value["EUFDNAME"] == comp][["FOODID","BESTLOC"]]
std = filtered.loc[:,"BESTLOC"].std(axis=0)
mean = filtered.loc[:,"BESTLOC"].mean(axis=0)
filtered[comp] = (filtered["BESTLOC"] - mean) / std
filtered = filtered[["FOODID", comp]]
df = pandas.merge(left=df,right=filtered, how='left', left_on='FOODID', right_on='FOODID')

df = df.fillna(0)

``````

# Food groups

Now that we have our normalized data, lets have a look at fruit, as a group, and see how that group compares to the other groups in our data set.

``````fruit = df.loc[df['DESCRIPT'] == 'Fruits']
vegies = df.loc[df['DESCRIPT'] == 'Vegetables']
reference_vectordistance = np.linalg.norm((vegies.mean() - fruit.mean()).values[1:])

rowlist = []

for foodtype in df["DESCRIPT"].unique():
if foodtype != 'Fruits':
other = df.loc[df['DESCRIPT'] == foodtype]
vectordistance = np.linalg.norm((other.mean() - fruit.mean()).values[1:])
if vectordistance/reference_vectordistance < 10.0001:
row = dict()
row["foodtype"] = foodtype
row["reldistance"] = vectordistance/reference_vectordistance
rowlist.append(row)

peergroups = pandas.DataFrame(rowlist)

with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(peergroups.sort_values(by=['reldistance']))

``````
``````                                    foodtype  reldistance
106             Baby fruit and berry product     0.254584
24                                    Juices     0.324371
23                    Fruit and berry salads     0.419500
27                               Juice drink     0.499597
89                     Fruit and berry soups     0.512050
82    Fruit and berry dishes other than pies     0.572982
67                              Other drinks     0.619280
84                           Vegetable soups     0.652777
107                   Baby vegetable product     0.655973
65                     Soft drink with sugar     0.710715
20                          Vegetable juices     0.716880
95                               Pulse soups     0.729302
83                             Potato dishes     0.742103
86                         Cooked vegetables     0.744533
90                          Vegetable sauces     0.753597
87                          Vegetable dishes     0.764830
109                           Baby fish dish     0.774588
68                            Drinking water     0.778795
97                                Meat soups     0.780833
42                                   Yoghurt     0.805619
93                              Pulse sauces     0.806330
38                             Milks skimmed     0.809094
108                           Baby meat dish     0.825446
70                                  Porridge     0.829892
116                              Sport drink     0.831736
100                            Poultry soups     0.842000
41                            Cultured milks     0.845855
59                                    Coffee     0.846159
60                                       Tea     0.850300
15                           Cooked potatoes     0.889403
94                              Pulse dishes     0.897811
37                             Milks >2% fat     0.907533
39                              Soured milks     0.912002
81                             Milk desserts     0.920313
44                             Milks <2% fat     0.928574
69         Drinks with artificial sweeteners     0.935823
112                             Seafood soup     0.936404
79                               Milk sauces     0.999358
17                                Vegetables     1.000000
43                                     Quark     1.000320
120                       Dietary supplement     1.013490
5                             Savoury sauces     1.037593
25                                   Berries     1.054466
103                               Fish soups     1.064546
16              Fried potatoes, French fries     1.106613
91           Prepared salads with mayonnaise     1.109879
78                                 Panncakes     1.132018
101                           Poultry dishes     1.169911
19                           Mushroom dishes     1.174364
49                                 Ice cream     1.194230
111  Seafood dishes, crustacean and molluscs     1.225429
96                               Meat sauces     1.226042
92                               Meat dishes     1.229997
105                           Dessert sauces     1.237735
40            Fermented milk products, other     1.259492
102                           Poultry sauces     1.285769
12                                      Rice     1.313129
45                                     Cream     1.329619
110                           Seafood sauces     1.334021
118                                    Pizza     1.337806
73                            Savoury bakery     1.343459
10                                     Pasta     1.345270
4                                 Condiments     1.378994
75                    Sandwiches and burgers     1.403498
74                              Sweet bakery     1.447861
66                                    Ciders     1.513203
104                              Fish sauces     1.523617
77                                      Buns     1.609077
80                                Egg dishes     1.611915
98                               Fish dishes     1.682396
99                                  Sausages     1.708877
21                                    Pulses     1.739442
46           Cheese, unripened, fresh cheese     1.739644
18                         Canned vegetables     1.781269
117                      Cold cuts, sausages     1.807501
57                  Crustaceans and molluscs     1.904892
53                           Cold cuts, meat     1.998762
114                         Savoury biscuits     2.146351
11                            Sweet biscuits     2.227579
3                  Miscellaneous ingredients     2.309356
52                   Chicken and other birds     2.392425
64                 Other alcoholic beverages     2.423097
7                                Cereal bars     2.439470
48                         Processed cheese      2.451552
50                          Steaks and chops     2.459782
13                         Breakfast cereals     2.524057
8                                      Flour     2.544412
22                            Pulse products     2.647546
55                                      Fish     2.676576
47                   Cheese, ripened cheese      2.809589
14                            Savoury snacks     2.811795
61                                     Beers     2.856793
115                        Meal replacements     2.947267
2                                  Chocolate     3.004876
62                                     Wines     3.229641
1                              Confectionery     3.240278
34                    Blended spread  < 55 %     3.340992
113           Infant formulas and human milk     3.342981
36           Margarine and fat spread  < 55%     3.349092
9               Nuts, seeds and dried fruits     3.505547
119                               Sport food     3.511651
51                             Meat products     3.523952
58                                       Egg     3.555769
56                             Fish products     3.588763
33           Salad dressings and mayonnaises     3.628788
6                                     Spices     4.116614
29                    Blended spread >= 55 %     4.244458
54                              Offal dishes     4.397559
0                           Sugar and syrups     4.436681
35           Margarine and fat spread >= 55%     4.517717
28                          Butter, milk fat     5.055117
30                Cooking and industrial fat     5.283595
32                                Animal fat     5.471986
31                                      Oils     7.149777
63                                   Spirits     8.044268
``````

Notice how SSBs as a group are 29% closer, nutritionally to fruit as a group than vegetables as a group are. At least according to our simple metric. Even drinking water and yogurt are. This isn't exactly giving us much to justify fruits being grouped with vegetables in nutition studies.

And this is just for the distance between the mean of these food groups. Lets pick a random fruit, lets say a banana and compare it different individual foods outside of the vegetables group.

# A Specific fruit

We looked at this for groups, now lets look at a specific fruit. One of my own favorites, a melon. And lets not look at SSB, drinking water and yogurt, but lets look at foods generally thought of as unhealthy that few people will think of comparing to a healthy peice of fruit. We take a look at McDonalds food and at chocolates and see how they compare to a melon.

``````banana = df.loc[df['FOODNAME'] == 'HONEYDEW MELON, WITHOUT SKIN']
rowlist = []
for index,row in df.iterrows():
food = row.values[1]
foodtype = row.values[2]
vector = row.values[3:]
distance = np.linalg.norm(vector)/reference_vectordistance
if "MCDONALD" in food or foodtype == "Chocolate":
row = dict()
row["food"] = food
row["distance"] = distance
rowlist.append(row)
peerfood = pandas.DataFrame(rowlist)
with pandas.option_context('display.max_rows', None, 'display.max_columns', None):
print(peerfood.sort_values(by=['distance']))

``````
``````    distance                                               food
21  0.956963                     MILKSHAKE, VANILLA, MCDONALD'S
23  1.515775                     HAMBURGER, MCFEAST, MCDONALD'S
10  1.589669         HAMBURGER, BEEF AND WHEAT ROLL, MCDONALD'S
11  1.647408               HAMBURGER, CHEESE BURGER, MCDONALD'S
13  1.647526      HAMBURGER, DOUBLE BURGER, BIG MAC, MCDONALD'S
12  1.728808              HAMBURGER, CHICKEN BURGER, MCDONALD'S
22  1.986608        HAMBURGER, DOUBLE CHEESE BURGER, MCDONALD'S
1   2.542666         CHOCOLATE CONFECTION FILLED WITH MARMALADE
14  2.915553            CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX
24  2.915631         CHOCOLATE CONFECTION FILLED WITH CHOCOLATE
18  2.978043  SUFFELI CHOCOLATE BAR, WAFFLE, TOFFEE FILLING ...
7   3.174270                             CHOCOLATE BAR, LOW-FAT
6   3.224222                CHOCOLATE BAR WITH FILLING, AVERAGE
2   3.336511                             CHOCOLATE BAR, AVERAGE
15  3.350807  SUFFELI PUFFI SNACKS,PUFFED CORN AND CHOCOLATE...
3   3.354410                   CHOCOLATE, PLAIN, DARK CHOCOLATE
8   3.409912                         CHOCOLATE, WHITE CHOCOLATE
0   3.431798                                 CHOCOLATE, AVERAGE
20  3.509146           CHOCOLATE, MILK CHOCOLATE WITH HAZELNUTS
4   3.763362                          CHOCOLATE, MILK CHOCOLATE
17  3.789385                               KINDER CHOCOLATE EGG
9   3.863914                                     RICE CHOCOLATE
19  4.175258              CHOCOLATE, PLAIN, DARK CHOCOLATE, 80%
5   5.557062                  CHOCOLATE, ARTIFICIALLY SWEETENED
``````

Notice that a milk shake is closer to a melon than the average vegetable. Now let us pick a few nice ones from this lis. The milkshake, the double cheese burger and the twix candy bar and see how different vegetables compare to these:

``````count1 = 0
count2 = 0
count3 = 0
tcount = 0
for index,row in df.iterrows():
food = row.values[1]
foodtype = row.values[2]
vector = row.values[3:]
distance = np.linalg.norm(vector)/reference_vectordistance
if "Vegetables" == foodtype:
tcount += 1
if distance > 2.915553:
count3 +=1
if distance > 1.986608:
count2 +=1
if distance > 0.956963:
count1 +=1
print("* A milkshake is nutritionally closer to a melon than", count1,"out of", tcount,"vegetables.")
print("* A double cheeseburger is nutritionally closer to a melon than", count2,"out of", tcount, "vegetables.")
print("* A Twix candy bar is nutritionally closer to a melon than", count3,"out of", tcount,"vegetables.")
``````
``````* A milkshake is nutritionally closer to a melon than 71 out of 103 vegetables.
* A double cheeseburger is nutritionally closer to a melon than 26 out of 103 vegetables.
* A Twix candy bar is nutritionally closer to a melon than 13 out of 103 vegetables.
``````

Still making sense to you to run interventional studies that put vegetables and fruits in the same group? I would argue it doesn't.

But then, maybe you don't trust the normalized nutrition vector. Lets have a quick look at what the normalized nutrition actually looks like for a banana vs brocoli, kale, twix and a McDonald's milkshake.

``````compare = ['MILKSHAKE, VANILLA, MCDONALD\'S','KALE','BROCCOLI','CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX']
part = df.loc[df['FOODNAME'].isin(compare)]
part = part.set_index('FOODNAME').drop(['FOODID','DESCRIPT'], axis=1)
part = part.transpose().rename(columns={"CHOCOLATE BAR, CARAMEL AND COOKIE, TWIX": "TWIX",
"MILKSHAKE, VANILLA, MCDONALD'S": "MILKSHAKE"})
names = eufdname.drop(['LANG'], axis=1).rename(columns={"THSCODE": "FOODNAME"}).set_index("FOODNAME")
#pandas.merge(how='left', right=names, left=part, left_on="FOODNAME", right_on="THSCODE")

pandas.merge(how='left', right=names, left=part, left_index=True, right_index = True).set_index("DESCRIPT")
#names
``````
BROCCOLI KALE TWIX MILKSHAKE
DESCRIPT
energy,calculated -0.014435 0.008692 2.939874 0.280767
fat, total 0.026593 0.046368 1.644287 0.140263
carbohydrate, available -0.296969 -0.291666 2.667469 0.195103
protein, total 0.403420 0.255376 0.251675 0.276349
alcohol 0.000000 0.000000 0.000000 0.000000
organic acids, total 0.100330 -0.328305 -0.446863 -0.122195
sugar alcohols 0.000000 0.000000 0.000000 0.000000
sugars, total -0.572030 -0.562058 4.221419 0.366879
fructose -0.299162 -0.260447 0.716580 -0.457189
galactose -0.131067 -0.131067 -0.156703 2.268470
glucose -0.277072 -0.277072 1.086415 0.312071
lactose 0.000000 0.000000 0.495146 0.566464
maltose 0.056397 0.056397 0.479377 0.028199
sucrose -0.539437 -0.539437 4.571218 0.135884
starch, total 0.008880 0.008880 0.518603 0.000000
fibre, total 0.590525 1.290406 0.171471 -0.174970
fibre, water-insoluble 0.490057 0.285866 0.314453 -0.163352
polysaccharides, non-cellulosic, water-soluble 0.572664 0.572664 0.155437 -0.163618
folate, total 1.546457 1.643241 0.034257 0.031612
niacin equivalents, total 0.319264 0.188485 0.116861 0.136289
niacin, preformed (nicotinic acid + nicotinamide) 0.241677 0.281957 -0.039877 -0.079753
vitamers pyridoxine (hydrochloride) 0.224716 1.303355 -0.134830 -0.044943
riboflavine 0.657697 1.176932 0.169617 0.636928
thiamin (vitamin B1) 0.369891 0.475574 -0.065524 0.030120
vitamin A retinol activity equivalents 0.080351 0.734835 0.025000 0.015968
carotenoids, total 1.107892 16.470065 -0.005260 -0.018312
vitamin B-12 (cobalamin) 0.000000 0.000000 0.011532 0.087641
vitamin C (ascorbic acid) 2.999856 3.037288 -0.475378 -0.419766
vitamin D 0.000000 0.000000 0.004403 0.004403
vitamin E alphatocopherol 0.244547 1.993427 0.420917 0.007411
vitamin K, total 2.221502 12.518243 0.049457 0.007459
calcium 0.095603 1.051637 0.113176 0.487941
iron, total 0.207805 0.221893 0.202170 -0.016906
iodide (iodine) -0.018761 -0.018761 -0.017968 -0.016768
potassium 0.469394 0.678014 -0.061256 -0.120452
magnesium 0.197976 0.395951 0.327828 0.044940
salt -0.012421 -0.018237 0.108501 0.012049
phosphorus 0.361836 0.200401 0.228235 0.406369
selenium, total 0.177199 0.177199 0.027749 0.046444
zinc 0.390879 0.188201 0.319218 0.290988
fatty acids, total 0.013528 0.036739 1.618932 0.141260
fatty acids, total polyunsaturated 0.036390 0.107449 0.432253 -0.001229
fatty acids, total monounsaturated cis 0.002320 0.003099 1.345681 0.080810
fatty acids, total saturated 0.004420 0.009818 2.052560 0.233120
fatty acids, total trans 0.000000 0.000000 0.241665 0.147224
fatty acids, total n-3 polyunsaturated 0.073295 0.236547 0.023455 -0.017662
fatty acids, total n-6 polyunsaturated 0.009726 0.033363 0.535066 0.002518
fatty acid 18:2 cis,cis n-6 (linoleic acid) 0.010242 0.035108 0.560131 0.001131
fatty acid 18:3 n-3 (alpha-linolenic acid) 0.077831 0.251260 0.024947 -0.018711
fatty acid 20:5 n-3 (EPA) 0.000000 0.000000 0.000000 0.000000
fatty acid 22:6 n-3 (DHA) 0.000000 0.000000 0.000000 0.000000
cholesterol (GC) 0.000000 0.000000 0.107896 0.071424
sterols, total 0.197817 0.039677 0.176732 -0.010203
tryptophan 0.256079 0.843467 0.256079 0.342292

I hope the simple analysis above shows and justifies my stance that fruits and vegetables grouped together in an interventional study is a horrible idea. Whatever te outcome, it will say very little about either fruit nor vegetables.

Note that in this analysis I didn't put any weight on any of the nutrients other than the data set did by grouping or not grouping nutrients together. Also the comparison is based on a per unit of weight basis. The results on a per calory basis are different but the same. Different in that other groups turn up as closer to fruit than vegetables, but the same in that fruits and vegetables turn out very much different and more different than many other obviously unrelated food groups in this data set. As I didn't want to make this blog post longet hant it already is, I ommitted the per kcal variant.

As you might have noticed, I am more comfortable with data than I am with biochemistry, so there might be major issues with analyzing the distance between different foods in the way that I did above. I'm here to learn, so if there are fundamental flaws with this way of looking at the data, please drop me a comment, or let me know on Twitter.

Sort Order:
·  2 months ago

Wow! That is an impressive collection of data. I haven't had a chance to actually analyze your findings, but these are the kinds of lists that invite attention (mine, anyway). Compliments on this hard work and for looking at an accepted idea and challenging it with science.
Respect!

·  2 months ago

Interesting stuff. Data science FTW!

·  2 months ago

This post has been voted on by the SteemSTEM curation team and voting trail. It is elligible for support from @curie and @minnowbooster.

If you appreciate the work we are doing, then consider supporting our witness @stem.witness. Additional witness support to the curie witness would be appreciated as well.