Machine Learning...and the Apocalypse?

Someone obviously didn't read the Q-Tips warning labels!

Trying to learn everything that goes along with Data Science in the span of 6 months can be a daunting task. Where do you begin? How in depth can you really get? Can I learn it all?

The Data Science master’s program director and lead professor at the University of Dundee work hard at answering these questions as best they can for myself and my classmates. While both have worked in the data science field commercially for a number of years, written numerous books and articles about databases, MDX, programming, deep learning, etc., they didn’t just rely on what they personally felt to be the most important aspects to teach their students. What they actually did was contact directors of data science at companies, CEOs, etc. and simply said…”You are going to want to employ our graduates. What do you expect them the know and be able to do?”

I'm slightly over simplifying it, but essentially that is what they do. They do this on an annual basis to adjust the program and the modules appropriately. I appreciate this approach as it keeps the program relevant, up-to-date, and I’m learning skills that employers are actually looking for.

That doesn’t go without saying, that sometimes, some aspects may be glanced over. For me, that concept was Machine Learning. Sure we talked about it, but not in great depth. We even got philosophical with trying to define machine learning, artificial intelligence, and can machines really “think” on their own. I believe one day, we will see computers using their "intelligence" to make unsupervised decisions. We also jokingly talked about how it may lead to the apocalypse when eventually the machines start deciding to take over the world. Hence we named our short lecture for that day “Machine Learning and the Apocalypse”.

Well, maybe not in my lifetime.

I just recently talked to the director, and they recognize machine learning's growing importance and they are revamping the modules for the next academic year. Which means for me, I started a year too early in the program.

But that’s ok. I mean isn’t one of the aspects of earning a Master’s degree to do independent learning and research? And to dive into those aspects driven by curiosity?

So that’s what I’m doing.

I randomly came across this website called Datacamp. There I found a tutorial about Machine Learning using baseball statistics. I love sports and baseball. I'm completely engaged in these data science concepts...and I suppose you could say my “dream job” once I finish this master’s in January would be to work for a professional sports organization to apply these data science skills. I was immediately drawn to this tutorial and had to try it.

I won’t bore you too much with the coding. But if you are curious, you can read through the tutorial to get the gist of it here

This is my first attempt (I loosely say 'first' because I tried it a bit before with an assignment, but didn't get very far with it as I was still trying to grasp the concept) at doing some machine learning on my own.

I input over 100 years of each baseball team's each team, for each season, going back to 1900, from a database into a statistical program (Python). I could train the program to develop an algorithm to determine the relationship between all those statistics (runs scored, hits, doubles, triples, HRs, walks, fielding percentage, errors, home-runs allowed, strike outs, etc) and make a correlation to overall games won for that team, for that year.

However, I altered the original code, in that I was curious if I could input the current statistics as of August 2nd, 2017 to predict the end of the year wins for each team (where as the way the tutorial is set up, would predict the wins up until the date the statistics were taken. So if a team played 162 games, it predicted wins out of those 162. If a team only played 100 games, it predicted the wins from those 100 total games. Thinking about it now, I suppose I could have used the winning percentage to calculate out the total wins in 162 games, and test if it was more accurate than the modeled results below).

Once the model is trained on the modified code, I could then input the teams' statistics as of August 2nd for this current season to predict how many wins they will have at the end of this season. There's 2 more months of baseball left until the playoffs. So I'm posting this up here now so we can go back to see how close I got (and if it ends up being really close, I won't be accused of changing the results to my favor).

There needs to be a method to the madness. Stirring would be a poor choice.

But before inputting the new data, you need to test the model with known results. I mean, what good is a model if's not accurate? When I tested my model with known results, on average, it was predicting the wins +- 2.7 from actual wins. I'm sure I could tweek, maybe add more variables, to reduce that variance. I also had not checked the distribution or standard deviations. Had this been an assignment, I obviously would be more thorough in testing the model. But like I said, this was my first attempt and I was more focused on if I could get it to work, not necessarily analyzing the accuracy...but an average of +-2.7 was not bad for a one days attempt....

Here are the results (from inputing the data from August 2nd, 2017).

So we'll see you in October when we can look at how well this model predicted the amount of wins!

download (1).jpg

Thanks for viewing!

Up Next

Part 2 of the tutorial is to predict which current position MLB players will make the HOF. I will use it as a guide to try and predict MLB pitchers' HOF probabilities.

I might also tackle the NFL, and see if college statistics and the NFL combine measures could predict how well an athlete will perform in the NFL.


Join us on #steemSTEM / Follow our curation trail on Streemian

