Soccer Predictions using Python (part 1)

in #python7 years ago

I've seen many articles online describing how the poisson distribution could potentially be used as a means of predicting soccer scores.  However, I haven't seen much in the way of practical examples.

It seems to be me that this might be an ideal subject for my first post on Steemit.

Now, before you read any further, I'm a hobbyist, so don't expect high quality code!  I've just recently started learning Python so this will be a learning process for me more than anything.  I taught myself 6502 assembly language in my early teens and wrote a couple of (unpublished) games on the Commodore 64.  I was also involved in the demo scene and coded many a fine starfield.  Later, I moved the Amiga and 68000 code and some C, before everything went PC and I ended up using Visual Basic.  

I drifted away from programming for a long time but now that I'm using Linux, I've got the bug again, so I've been learning Python.  Big shout out to pythonprogramming.net & www.youtube.com/user/sentdex for getting me this far.

But enough about me, lets get to work.

First thing we need is some data, so we're going to use Beautiful Soup to scrape some from the web.

There are plenty of places to find historical soccer results, some with ready made .CSV files that can be downloaded, but I've decided to scrape raw data from the Soccer Punter website (www.soccerpunter.com).  There are a couple of reasons for this choice but the main one is that the .CSVs available usually seem to take a few days before they're updated with the latest results.

I'm going to assume you all know how to install pandas, beautifulsoup and selenium or are clever enough to find out how elsewhere. ;-)

Here's what I've came up with so far...

import pandas as pd
from bs4 import BeautifulSoup as bs
from selenium import webdriver
import datetime

def scrapeseason(country, comp, season):
   # output what the function is attempting to do.
   print("Scraping:", country, comp, str(season)+"-"+str(season+1))
   baseurl = "http://www.soccerpunter.com/soccer-statistics/"
   scrapeaddress = (baseurl + country + "/" + comp.replace(" ", "-").replace("/", "-") + "-"
                    + str(season) + "-" + str(season + 1) + "/results")
   print("URL:", scrapeaddress)
   print("")

   # scrape the page and create beautifulsoup object
   sess = webdriver.PhantomJS()
   sess.get(scrapeaddress)
   page = bs(sess.page_source, "lxml")

   # find the main data table within the page source
   maintable = page.find("table", "competitionRanking")

   # seperate the data table into rows
   games = maintable.find_all("tr")

   # create an empty pandas dataframe to store our data
   df = pd.DataFrame(columns=["date", "homeTeam", "homeScore", "awayScore", "awayTeam"])

   idx = 0
   today = datetime.date.today()

   for game in games:

       # these lines filter out any rows not containing game data, some competitions contain extra info.
       try:
           cls = game["class"]
       except:
           cls = "none"
       if ("titleSpace" not in cls and "compHeading" not in cls and
               "matchEvents" not in cls and "compSubTitle" not in cls and cls != "none"):

           datestr = game.find("a").text
           gamedate = datetime.datetime.strptime(datestr, "%d/%m/%Y").date()

           # filter out "extra time", "penalty shootout" and "neutral ground" markers
           hometeam = game.find("td", "teamHome").text
           hometeam = hometeam.replace("[ET]", "").replace("[PS]", "").replace("[N]", "").strip()
           awayteam = game.find("td", "teamAway").text
           awayteam = awayteam.replace("[ET]", "").replace("[PS]", "").replace("[N]", "").strip()

           # if game was played before today, try and get the score
           if gamedate < today:
               scorestr = game.find("td", "score").text

               # if the string holding the scores doesn't contain " - " then it hasn't yet been updated
               if " - " in scorestr:
                   homescore, awayscore = scorestr.split(" - ")

                   # make sure the game wasn't cancelled postponed or suspended
                   if homescore != "C" and homescore != "P" and homescore != "S":
                       # store game in dataframe
                       df.loc[idx] = {"date": gamedate.strftime("%Y-%m-%d"),
                                      "homeTeam": hometeam,
                                      "homeScore": int(homescore),
                                      "awayScore": int(awayscore),
                                      "awayTeam": awayteam}
                       # update our index
                       idx += 1
           else:
               # it's a future game, so store it with scores of -1
               df.loc[idx] = {"date": gamedate.strftime("%Y-%m-%d"),
                              "homeTeam": hometeam,
                              "homeScore": -1,
                              "awayScore": -1,
                              "awayTeam": awayteam}
               idx += 1

   # sort our dataframe by date
   df.sort_values(['date', 'homeTeam'], ascending=[True, True], inplace=True)
   df.reset_index(inplace=True, drop=True)
   # add a column containing the season, it'll come in handy later.
   df["season"] = season
   return df

# set which country and competition we want to use
# others to try, "Scotland" & "Premiership" or "Europe" & "UEFA Champions League"
country = "England"
competition = "Premier League"
lastseason = 2016
thisseason = 2017

lastseasondata = scrapeseason(country, competition, lastseason)
thisseasondata = scrapeseason(country, competition, thisseason)

# combine our data to one frame
data = pd.concat([lastseasondata, thisseasondata])
data.reset_index(inplace=True, drop=True)

# save to file so we don't need to scrape multiple times
data.to_csv("data.csv")

Okay, that's enough for now.  If you run this you'll have a file called data.csv.  Load it up in a spreadsheet and confirm it looks OK and I'll be back soon with some code to do something with our new data.

In the meantime, If anyone has any questions, tips, advice or abuse they'd like to share, please do.


Coin Marketplace

STEEM 0.17
TRX 0.16
JST 0.031
BTC 60327.71
ETH 2568.97
USDT 1.00
SBD 2.57