Initialization

Ever wonder who had the heaviest offensive line in the NFL? Or which team had the most penalties over the past few seasons? What if I told you that you didn't need to be an ESPN statistician to figure this out? What if---after a minimal investment in setup---you could easily query NFL data and start building your own predictive models?

Well I'm here to tell you that it is actually pretty easy to do---even for non-coders. After a few days of tinkering I was able to establish an easy to use, mostly accurate pipeline for analyzing NFL data. This post is directed towards people of all skill levels, but particularly novices.

What you'll need:

  • Willingness to learn new things and to grind out the many small issues that inevitably arise
  • A computer running MacOS (or any other Unix based system)
  • Python
  • nfldb
  • PostgreSQL

Python

Python is a powerful yet easy to use programming language capable of processing and analyzing data. It is widely used throughout the industry and is pre-installed on MacOS (although we will be using Python 3). There are about a million different tutorials you can use to become more familiar with how Python is used; I recommend Python the Hard Way. Interactive tutorials are available at Datacamp.

Although Python is pre-installed on MacOS, I recommend installing it via Anaconda. Anaconda is a convenient bundle of Python and many of the additional libraries necessary to do data analysis. I will be using Python 3, so be sure to download that version (note that the download may take a while).

Python code can be edited and run either by the command line or an interactive development environment (IDE). I highly recommend using an IDE as they are significanlty more useful in working with code. Anaconda comes with Jupyter (what I am using to write and publish this article). Jupyter uses your web browser to allow you to edit and execute your code in chunks. However, I use Spyder to develop my code because it allows me to see variables, dataframes, arrays, etc. that I have generated. Once installed type either jupyter notebook or spyder in terminal and a new session will begin.

SQL

SQL (structured query language) is an ancient method of querying databases. It's super easy to figure out once you see it in practice. The bulk of this post will be a series of queries designed to illustrate the basic functionality of SQL.

PostgreSQL

PostgreSQL is a database management system that allows you to store and query your data using SQL. There are several ways to install PostgreSQL. You can either use the Homebrew method or Postgres App. The Postgres App is the easiest option for MacOS. I am currently using Postico to view my data and test out different queries.

NFLDB

Once you have PostgreSQL setup, you are ready to get the NFL data. nfldb is a module designed to scrape data from the NFL website and put it into a SQL table. The instructions for installing nfldb are overly complicated, so here is a much more simple guide. Basically, all you need to do is create a blank Postgres datbase, run the SQL file, and then make a minor configuration change.

The following is a list of instruction you will need to type into your command line. On Mac, simply open terminal and make sure you are in your ~ (aka your root directory). If you are unsure what the name of your ~ directory is, open finder and look at the name next to the house icon. Your terminal should open to ~ by default.

  1. Open terminal and type psql to launch PostgreSQL
  2. Type CREATE DATABASE nfldb;, then \connect nfldb, and then CREATEUSER nfldb;
  3. Download the SQL file and unzip it to your ~ directory
  4. Make sure terminal is in your ~ directory and type psql -U nfldb nfldb < nfldb.sql. This should create an empty NFL database in PostgreSQL
  5. Now type pip install nfldb to install the Python code necessary to fill in the NFL database
  6. Now type locate config.ini.sample. It should return a path like /usr/local/share/nfldb/config.ini.sample. Highlight this path and copy it
  7. Type mkdir -p $HOME/.config/nfldb and press enter
  8. Type cp <paste the path from step 6 here> $HOME/.config/nfldb/config.ini

And that's it! Everything should work now. The installation guide has a few exercises to make sure everything is up and running.

Querying the Data

First thing we need to do is import the libraries we will be using. Pandas is an incredibly useful library for manipulating data. Here we will use the pandas.read_sql method to query the data and display the results in a nice table format. We will also be using psycopg2 to create a connection to the PostgreSQL database.

If for some reason neither pandas nor psycog2 is not installed, use pip <package name> or conda install <package name> to perform the installation.

In [1]:
# import pandas
import pandas as pd

# import psycopg2
import psycopg2

# provide your credentials; yours should look the same except with you user name
conn=psycopg2.connect("dbname='nfldb' user='kthomas1' host='localhost' password='' port=5432")

# create the connection
cur=conn.cursor()

SELECT, FROM, WHERE, LIMIT

Here we will perform the most basic query. Let's look at the first 10 players for the Carolina Panthers listed in the player table. First we SELECT the variables we want FROM the player table. We use the WHERE command to filter to just the Panthers. Finally, we only want the first 10 observations, so type LIMIT 10 at the end.

In [2]:
pd.read_sql("""SELECT full_name, position, uniform_number, height, weight, college 
FROM player 
WHERE team='CAR' 
LIMIT 10""",con=conn)
Out[2]:
full_name position uniform_number height weight college
0 Robert McClain DB 27 69 195 Connecticut
1 Daryl Williams OT 60 78 335 Oklahoma
2 Jonathan Stewart RB 28 70 240 Oregon
3 Brenton Bersin WR 11 75 210 Wofford
4 Ted Ginn WR 19 71 180 Ohio State
5 James Bradberry CB 24 73 210 Samford
6 Daryl Worley CB 26 73 205 West Virginia
7 Ryan Kalil C 67 74 300 USC
8 Dean Marlowe FS 29 73 210 James Madison
9 Tyler Larsen OG 69 76 335 Utah

And here we see the first limitation of the NFLDB dataset---it is outdated. Several of these players are not on the Carolina roster as of 2017. I have tried updating the database but have been unsuccessful. I will update this post if I can determine a solution.

One of the nice things about Jupyer + pandas are that these dataframes have sortable columns.

COUNT, AVG, SUM, MAX, MIN

SQL allows us to count the number of entries that meet a certain criteria. We can also take the sum or average or find the max/min of a column.

In [3]:
# total number of posessions by Carolina from 2009-2016
cur.execute("""SELECT COUNT(pos_team)
FROM play
WHERE pos_team='CAR'""")

pos,=cur.fetchall()
print("Total Number of Possessions by Carolin from 2009-2016: {:,}".format(int(pos[0])))
Total Number of Possessions by Carolin from 2009-2016: 13,532
In [5]:
# average penalty yards for buffalo
cur.execute("""SELECT AVG(penalty_yds)
FROM play
WHERE pos_team='BUF' AND penalty_yds!=0""")

penalty,=cur.fetchall()
print("Average Penalty Yards for Buffalo from 2009-2016: {:.2f}".format(float(penalty[0])))
Average Penalty Yards for Buffalo from 2009-2016: 8.64
In [6]:
# total weight of offensive line for oakland
cur.execute("""SELECT sum(weight)
FROM player
WHERE team='OAK' AND (position='OL' OR position='OG' OR position='OT' OR position='C')""")

weight,=cur.fetchall()
print("Total Weight of the Oakland offensive line: {:,} lbs".format(int(weight[0])))
Total Weight of the Oakland offensive line: 1,611 lbs
In [7]:
# the shortest player in the database
cur.execute("""SELECT min(height)
FROM player
WHERE status='Active'""")

height,=cur.fetchall()
height_feet = float(height[0])/12
print("Height of the shortest player in the database: {:.1f} ft".format(float(height_feet)))
Height of the shortest player in the database: 5.5 ft

ORDER BY

If we want to sort our queries we can use ORDER BY. Here we get a list of the 10 heaviest players in the database by using a descending order by on weight.

In [8]:
pd.read_sql("""select full_name, position, team, weight
from player
where status='Active'
order by weight DESC
limit 10""",con=conn)
Out[8]:
full_name position team weight
0 John Jenkins DT NO 359
1 Denzelle Good OT IND 355
2 Trent Brown OT SF 355
3 Dan McCullers-Sanders NT PIT 352
4 Alan Branch DT NE 350
5 Damon Harrison DT NYG 350
6 Dontari Poe NT KC 346
7 Cordy Glenn T BUF 345
8 Haloti Ngata DT DET 345
9 Paul Soliai DT CAR 345

DISTINCT, GROUP BY, CHAR_LENGTH

When exploring a database it is often useful to see the unique values in a column. SQL allows us to do that by using the DISTINCT command.

In [9]:
# all the positions from the player db
pd.read_sql("""select DISTINCT position
from player""",con=conn)
Out[9]:
position
0 RB
1 WR
2 G
3 SS
4 TE
5 DE
6 MLB
7 OG
8 NT
9 DB
10 LB
11 SAF
12 OT
13 UNK
14 K
15 FB
16 T
17 OL
18 ILB
19 QB
20 DT
21 LS
22 P
23 OLB
24 DL
25 C
26 FS
27 CB

To figure out who had the most penalty yards from 2009-2016, we'll need to combine SUM, GROUP BY, and then ORDER BY. The query below essentially sums every entry in the penalty_yds column for each team. We also use the AS operator to rename the sum(penalty_yds).

In [10]:
# who had the most penalty yards for 2009-2016?
pd.read_sql("""select pos_team, sum(penalty_yds) as total_penalty_yards
from play
group by pos_team
order by sum(penalty_yds) DESC""",con=conn)
Out[10]:
pos_team total_penalty_yards
0 GB 10443
1 BAL 10375
2 SEA 10017
3 OAK 9909
4 TB 9849
5 PIT 9691
6 DEN 9577
7 IND 9570
8 ARI 9562
9 PHI 9506
10 CIN 9469
11 NE 9414
12 NO 9359
13 WAS 9240
14 NYG 9109
15 BUF 9083
16 TEN 8999
17 ATL 8993
18 DAL 8915
19 DET 8913
20 SF 8802
21 CHI 8782
22 SD 8758
23 MIN 8712
24 CAR 8668
25 JAC 8656
26 NYJ 8437
27 KC 8373
28 MIA 8236
29 CLE 8156
30 STL 7957
31 HOU 7909
32 LA 1147
33 JAX 110
34 LAC 73
35 UNK 0

Every play in the database is accompanied by a description of what happened on that play. If we wanted to find out which play has the longest description we can use the CHAR_LENGTH function along with ORDER BY DESC.

In [11]:
# play that have the longest description

cur.execute("""select description
from play
order by char_length(description) DESC
limit 1""")

cur.fetchall()
Out[11]:
[('(:50) (Shotgun) D.Brees pass short left to M.Colston to CLE 22 for 4 yards (E.Wright). FUMBLES (E.Wright), and recovers at CLE 23. M.Colston to CLE 23 for no gain (E.Wright). After the play, the Saints attempted to rush a field goal attempt, as Cleveland through the CHALLENGE flag, asserting that Colston FUMBLED and the ball was recovered by Cleveland.  This challenge was DENIED, as the play was not reviewable. After the third quarter ended, the Saints then challenged the ruling that Colston was down by contact. New Orleans challenged the runner was down by contact ruling, and the play was REVERSED. (Shotgun) D.Brees pass short left to M.Colston to CLE 21 for 5 yards (E.Wright). After the play, the Saints attempted to rush a field goal attempt, as Cleveland through the CHALLENGE flag, asserting that Colston FUMBLED and the ball was recovered by Cleveland.  This challenge was DENIED, as the play was not reviewable. After the third quarter ended, the Saints then challenged the ruling that Colston was down by contact.  The overturn on review negated the fumble.',)]

I've read through that play description a few times and I still don't know what happened ¯\| ಠ ∧ ಠ |

JOINS

Joining data is one of the most important functions of SQL but it is also the one most likely to lead to confusion. There are many resources on SQL joins, but here we will just give a basic example.

Our queries so far have used the entire dataset. For example, the table on most penalty yards was from all years in the play dataset. If we wanted to filter to just one year we would need to bring in the year from the game dataset. This would require us to join the game dataset with the play dataset.

When joining tables, each variable that is selected needs to be prefaced by the name of the table it is coming from followed by a ".". For example, selecting pos_team from the play dataset would be written as play.pos_team.

This is a very simple join and as such we can match in the tables simply using a WHERE clause that matches the shared game ids (gsis_id) between the two tables. We then filter to the year 2015 and the regular season.

In [12]:
# most penalty yards in 2015; need to join play to game and filter to regular season
pd.read_sql("""select play.pos_team, sum(play.penalty_yds)
from play, game where play.gsis_id=game.gsis_id and game.season_year=2015 and game.season_type='Regular'
group by play.pos_team
order by sum(play.penalty_yds) DESC
limit 10""",con=conn)
Out[12]:
pos_team sum
0 GB 1296
1 TB 1152
2 ARI 1149
3 SD 1104
4 NE 1064
5 PHI 1061
6 BUF 1059
7 STL 1029
8 PIT 1023
9 NYG 1020

WITH SUBQUERIES

This is a slightly more advanced topic. WITH statements are a powerful tool that allows you to create sub tables and then join them together. For example, if wanted additional aggregate statistics for each drive not including in the drives dataset, we would have to aggregate these numbers from the play table to the drive level then join it to the drive table. The WITH statement allows us to do this in one query.

For clarity is an example. Let's say we want the following table:

game_id drive_id yards_gained passing_first_downs
1 1 40 2
1 2 70 3
1 3 20 0

We would have to join the drive table:

game_id drive_id yards_gained
1 1 40
1 2 70
1 3 20

With the play table (here is a truncated example --- imagine the passing_first_down column sums to 2 where drive_id==1):

game_id drive_id play_id passing_first_down
1 1 1 0
1 1 2 1
1 1 3 0

We would take the sum of passing_first_down across game_id and drive_id and then link the drive table and the play table on game_id and drive_id.

In [13]:
pd.read_sql("""with plays as
	(select gsis_id, drive_id, sum(passing_first_down) as pass_first, sum(penalty_first_down) as penalty_first, sum(rushing_first_down) as rushing_first, sum(first_down) as first, sum(third_down_conv) as third
	from play
	group by gsis_id, drive_id),
drives as
	(select gsis_id, drive_id, pos_team, result, start_field, pos_time, penalty_yards, yards_gained, play_count
	from drive)
select * from drives
inner join plays on (drives.gsis_id=plays.gsis_id and drives.drive_id=plays.drive_id)
limit 10""",con=conn)
Out[13]:
gsis_id drive_id pos_team result start_field pos_time penalty_yards yards_gained play_count gsis_id drive_id pass_first penalty_first rushing_first first third
0 2009081350 1 PIT Punt (-30) (239) 4 15 6 2009081350 1 0 0 1 1 1
1 2009081350 3 PIT Punt (-42) (344) -5 33 8 2009081350 3 2 0 0 2 1
2 2009081350 5 PIT Field Goal (-39) (193) -5 62 6 2009081350 5 1 0 0 1 0
3 2009081350 12 PIT Punt (-22) (55) 0 7 3 2009081350 12 0 0 0 0 0
4 2009081350 14 PIT Punt (-40) (92) 0 8 3 2009081350 14 0 0 0 0 0
5 2009081350 15 ARI Interception (-4) (29) -10 0 1 2009081350 15 0 0 0 0 0
6 2009081350 20 PIT Punt (-16) (77) -5 3 3 2009081350 20 0 0 0 0 0
7 2009081350 23 ARI End of Game (-18) (101) 5 55 10 2009081350 23 3 0 1 4 1
8 2009081351 1 PHI Punt (-23) (100) -10 16 3 2009081351 1 0 0 0 0 0
9 2009081351 3 PHI Punt (-28) (54) 0 3 3 2009081351 3 0 0 0 0 0