A post mortem analysis of a Data Science approach for determining the existence and decay patterns of the Higgs boson.
This post originally appeared on kasperfred.com
In 2013, the CERN LHC Atlas team released a dataset containing simulated proton-proton collisions some of which resulted in a 125 GeV Higgs which would decay to two taus. Others resulted in two top quarks decaying to either a lepton or a tau, or a W boson decaying to either an electron or a muon and tau pair.
The problem was construct a neural network that correctly segmentize the events using features that can either be measured directly in the accelerator, or can be derived from the measurements.
The final network consisting of only a single hidden layer was able to predict the existence of the Higgs boson with an accuracy of 99.997% over 5 positive predictions (the network thinking that the Higgs exists). For such a simple network, I find that number to be rather impressive.This piece will be a summarization of my immediate reflections following the project.
How to organize large machine learning projects
This was the first large project I've done. All my previous projects have been small enough to live in a single file. It's also the first time I worked with multiple very different models for the same problem.Furthermore, the dataset didn't have a standard approach as is the case with many of the old stables such as the MNIST and Iris datasets. Another unforseen problem was that the dataset was so large that I ran into GPU memory issues which had to be addressed.This change in scope necessitated that I found a better way of organizing the code. While you can find the code on Github, I'd like to highlight a few thing that worked out well, and some that didn't.
- I abstracted the data processing away from the actual code. I found this push to specify common transformations and data handling in a layer of abstraction above the code to be very valuable. Though I didn't have time to do it, this could also have been done for the actual models.The abstraction layer comes with a few benefits. For one, it's implementation independent, so one specification may be implemented using different frameworks depending on if it's for a production or research environment.
Now, this might not work for everything, as depending on the specific level of abstraction, it might not work for new exotic models, or it may be too general such that it doesn't save much time over just coding it up, but even then, I do see potential for such a common language.
- I separated each model into their own python module, and placed them in a 'models' folder. All these models were then accessed from the main module where they are trained and evaluated. The trained models are then saved into another folder. This worked great. I didn't find any problems with this approach, and it will be what I'll continue to do in the future.
- I created a class around all the models. The idea was to get a common sklearn-like interface for training, and evaluation. Furthermore, I wanted to segmentize the different parts of the model into distinct sections, so as to isolate the dependencies. It worked alright, but there was more shared code between the models than I'd like. Some of this could have been mitigated by having the models inherit a meta model class, but due to the differences in implementation, I'd likely need multiple meta classes. This is probably the part of the scaling process that I'm the least satisfied with.
In conclusion, while the scaling went okay everything considered, but there's still a lot of room for improvement.
One interesting challenge was how to decide on which model is the best. I'll eventually write a whole essay about discussing the different techniques for comparing, and evaluating performance of a model, and how to compare it to other models, but in summation, I ended up using a combination failure type analysis, and statistical hypothesis testing.With that said, however, this step was a significant bottleneck in the pipeline as each model took a couple of hours to train, so identifying what is wrong, and adjust that appropriately became very important.How best to do this, however, is something that I havne't figured out yet, and as it's not something that has been written much about it would seem that I'm not alone there. (If you know something about this, you're more than welcome to contact me)One potential solution, if you have the computing power, might be to use genetic, or machine learning algorithms to automatically come up with network architectures. I've yet to look much into this, but it looks interesting.
Hidden surprises in the data
I'll talk a lot more about this in the detailed analysis, but if I had to name one thing that surprised me about the data, it'd be the inverse exponential relative importance of the features as seen on the graph below.Specifically, it surprised me how well the shallow network was able encode this information.
These were some of my immediate thoughts regarding the project.Once the paper has been graded, I'll follow up with a more in depth analysis of the models, and how they may be improved upon.