Kiplagat Seroney
8 min readJul 11, 2022

#004 DATA SCIENCE BASICS USING THE 2020/21 BUNDESLIGA SEASON DATASET WITH PYTHON & FRIENDS.

“The Home of the 2020/21 Bundesliga Champions”, Photo by Herr Bohn on Unsplash.

Introduction

Hey there !!! Happy to be writing once again and I hope you have been keeping well. As you recall from my previous blogs, I started my Data Science journey and I will be putting the various skills gained to go through my process of performing Exploratory Data Analysis and visualize some of the insights derived. Enjoy the journey and hopefully learn and derive your own insights regardless of the dataset.

Season Summary

  • The Season started on 18th September 2020 and ended at the 22nd May 2021.
  • Bayern Munich successfully defended their title by adding their ninth consecutive title on the Trophy cabinet. That is 30 Bundesliga titles for Bayern Munich. Bayern started with a 8–0 win in the first game, that is 8% of the Goals For in the season. Such a strong start in a season.
  • With COVID-19 still lingering, the 5 player substitution model arising from fixtures congestion and playing with no and/or with restricted number of fans was the new normal. We will explore this using the Minutes Metrics
  • The season saw Robert Lewandowski score 41 Goals including Five hatricks. Andre Silva came in Second with 28 Goals. It is important to note that Minutes played matter and we will be visualizing this later in this article.

This was just but a quick summary, now lets dive into the Geek(ish) stuff.

ANALYSIS

The tools that facilitated this analysis include:

  • Scrapping the JSON using Python with inspiration from Amos Bastian
  • Python with pandas and seaborn. This helped me in EDA including, descriptive summary statistics and plotting for insights.
  • Tableau Public for visualization and finally;
  • Good ol’ Google.

Getting the data

Using this understat python package, I was able to quickly get my JSON data. This is a snapshot of the said data.

JSON output from the understat package by Amos Bastian.

Now that we have our data, lets start our analysis.

NB: we can source the data in various ways; we can scrape the data on understat thanks to McKay Johns OR scraping using Google sheets by Rob Carrol.

Exploratory Data Analysis

The table below show us the league standings as at the end of season

END OF THE 2020/21 BUNDESLIGA SEASON

I first started by loading the data using pandas. Lucky for us the scrapped data is clean for our intended purposes.

We proceed to confirm the cleanliness of our data by using the .info() method.

The dataset we are using is small therefore it is easy to quickly read the table as we can see there is no null character and the datatypes are well applied and presented.

From here, we describe the summary stats of our metrics. We will use the .describe() method. Below is a sample snapshot. However, keep in mind that their might be outliers in the data.

descriptive stats of the Bundesliga 2020/21 Season.

Insights derived from the table above include;

  • The Most wins (max) in the season was 24 by Bayern Munich. The least number of wins was 3 by a now relegated Schalke 04.
  • Bayern Munich scored the Most number of Goals i.e. 99 goals. In a 34 match season, that is an average of 2.9118 Goals per match. The second team with respect to goals was Borussia Dortmund with 75 goals. That’s an average of 2.2059 Goals per Match.

This is just but a few ways in which we can use Descriptive Statistics to derive valuable insight to help cement out analysis.

CORRELATION HAS ENTERED THE CHAT

We will be exploring data from understat.com to produce the correlation analysis.

RAW CORRELATION MATRIX USING DATA FROM UNDERSTAT.COM

Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient. The correlation coefficient’s values range between -1.0 and 1.0.

Correlation measures the degree in which two variables move in relation to each other. We can quickly see that expected goals(xG) and Goals are positively correlated with a score of 0.95. This makes logical sense.

Lets further interrogate the data by asking;

  • Are the teams in the Bundesliga taking the maximum of their chances ? or rather;
  • Are assists being converted to goals ?

We can answer this by looking at the correlation between Assists and Goals which is a score of: 0.53 (a moderate score).

We will explore various correlations of features like goals and assist across various leagues in the next article to see what’s what across the various leagues.

NB : 0.0 to 0.3 Reps weak positive correlation. 0.3 to 0.7 Reps moderate positive correlation and 0.7 to 1.0 Reps Strong positive correlation.

Team Analysis and Player Analysis

1. Team Analysis

Correlation

Let us now use our football knowledge with the help of a correlation matrix to gain insights.

From the Above table we can sample out the following;

a. Strong positive Correlation Features include:

  • Points with Goals, xG, xPTS
  • xG with Goals, Points , xPTS, Goals per Game etc.
  • Goals with Points, xG, xPTS etc.

b. Strong negative Correlation Features include:

  • Goals with Goals Against — This implies a negative relationship between the two columns.
  • Poinst with Goals Against — Same implecation as above

Team Insights from data

Points vs Expected Points.

Photo by Tobias Rehbein on Unsplash

RB Leipzig finished second in the 2020/21 Season with 65 points, 13 points off the Table topping Bayern Munich.

In football, outcomes are a result of various actions/reactions (Remember the correlation above?). 19 Wins (57 Points)and 8 draws (8 points gained) was enough to see RB Leipzig take second place however, 8 draws (16 points lost) and 7 losses (21 points lost) was enough to see them second last in the metric we are using today, Difference between Points and Expected Points.

Difference Between Points and Expected Points

This part of Analysis was heavily inspired by Thomas Whelan.(His platforms have a lot of material and resources to learn Data Science). Checkout his websites on Data Science and Machine Learning and fantasy soccer analysis using Python.

The following Viz was done in Python using seaborn. It visualizes the difference between points and Expected Points. It is interesting to see the diagram below with respect to the actual points by the end of the season. Kindly take some time and look at the points table and the diagram illustrating Points vs Expected Points.

From the above graphic, we can quickly see which teams over/underperformed using the Difference between Points and Expected Points model.

2. PLAYER ANALYSIS

The player data is from understat. Lets take a quick look at the summary statistics.

We can summarize the 2020/21 season as:

  • Most Played games is 34 games with the least being 1 game and the average is 19 games.
  • Most Goals scored is 41 goals (Lewandowski) with the least being 0 Goals and the average is 2 Goals.
  • Most xG is 32 (Lewandowski) Expected Goals with the least being 0 and the average is 2
  • Most Assists is 18 assists (Muller) with a minimum of 0 assists and the average being 1
  • Most Shots were 135 Shots (Lewandowski) with a minimum of 0 Shots and the average being 15.
  • Most Yellow Cards were 11 Cards (Nicolas Höfler) with a minimum of 0 cards and the average being 2.
  • Most Non Penalty Goals were 33 NPG (Lewandowski) with a minimum of 0 with the average being 2.

Player insights from Data.

We can fine tune the data by removing “noise” . I will classify a specific group with respect to what my needs are for example, I will look at the distribution of scoring players and group them according to goals scored. Lets use a graph to illustrate:

We observe the skewness of the data. Less and less players in the league are scoring. Which makes sense since there are a whole lot of goalkeepers, defenders, midfielders and also strikers who did not score. We can analyze the from the information on graphic and ask:

  • Who are the least scoring striker?
  • Who are the Top scoring defenders? etc.

Now that have a snapshot of the season’s stats, let us start asking specific questions like:

1. Who were the Top Scorers?

2. Who bagged the Most Assists?

3. Who Played less (Less that 2200 Minutes or 24 Games Per 90)and Score more (Over 10 Goals) ?

4. Shots Converted to Goals rate?

We can vary the answers to suit the questioned asked. You have to appreciate and love the art, science, power and capabilities of technology and information.

CONCLUSION

There you have it. The possibilities and technologies in Data Science helps one get creative in querying data. This types of analysis applies to all ‘spreadsheet-like’ data. Feel free to get to contact me incase of any question. Thanks and Cheers.

Kiplagat Seroney
Kiplagat Seroney

No responses yet