Over the past several years, there has been extensive debate among NBA fans as to who is the GOAT (Greatest of all time). The two most common answers to this question seem to be Michael Jordan and LeBron James. Here are some articles ranking the best players of all time:
These are the top 4 Google resulst when looking up "best basketball players of all time". They all agree on who the top 2 are, and that they are incredibly close.
LeBron James' accomplishments include: 10 NBA Finals appearances, 4 NBA championship titles, 4x NBA Finals MVP, 4x League MVP, 17x All Star, 2007-08 Scoring Champ, 2019-20 Assist Champ.
Michael Jordan's accomplishments include: 6x NBA Finalist, 6x NBA champion, 4x NBA Finals MVP, 5x league MVP, 14x All Star, 10x Scoring Champ.
I will attempt to show which player has the more impressive career and truly deserves the title of The GOAT.
These are the libraries that I will use to scrape data, store it, and graph it.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
#included this to get rid of warnings that made seeing outputs difficult
import warnings
warnings.filterwarnings('ignore')
All of the data used in the project was gotten from basketball-reference.com
#Getting Lebron's career stats
james_page = requests.get('https://www.basketball-reference.com/players/j/jamesle01.html')
james_soup = BeautifulSoup(james_page.text, 'html.parser')
james_per_game = pd.read_html(james_page.text)
james = pd.DataFrame(james_per_game[0])
#LeBron's playoff stats
james_po = pd.DataFrame(james_per_game[1])
#getting Jordan's career stats
jordan_page = requests.get('https://www.basketball-reference.com/players/j/jordami01.html')
jordan_soup = BeautifulSoup(jordan_page.text, 'html.parser')
jordan_per_game = pd.read_html(jordan_page.text)
jordan = pd.DataFrame(jordan_per_game[0])
#Jordan's Playoff stats
jordan_po = pd.DataFrame(jordan_per_game[1])
Below, I have made dataframes which includes the stats of all players in the league for each season played by LeBron and Jordan. The 1985-86 season is left out because Jordan only played 18/82 games that season. The 1994-95 season is omitted for the same reason.
#scrape data for specific year, then add year column
league_data_04 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2004_per_game.html')[0])
league_data_04["year"] = [2004 for k in range(0,league_data_04.GS.count())]
league_data_05 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2005_per_game.html')[0])
league_data_05["year"] = [2005 for k in range(0,league_data_05.GS.count())]
league_data_06 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2006_per_game.html')[0])
league_data_06["year"] = [2006 for k in range(0,league_data_06.GS.count())]
league_data_07 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2007_per_game.html')[0])
league_data_07["year"] = [2007 for k in range(0,league_data_07.GS.count())]
league_data_08 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2008_per_game.html')[0])
league_data_08["year"] = [2008 for k in range(0,league_data_08.GS.count())]
league_data_09 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2009_per_game.html')[0])
league_data_09["year"] = [2009 for k in range(0,league_data_09.GS.count())]
league_data_10 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2010_per_game.html')[0])
league_data_10["year"] = [2010 for k in range(0,league_data_10.GS.count())]
league_data_11 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2011_per_game.html')[0])
league_data_11["year"] = [2011 for k in range(0,league_data_11.GS.count())]
league_data_12 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2012_per_game.html')[0])
league_data_12["year"] = [2012 for k in range(0,league_data_12.GS.count())]
league_data_13 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2013_per_game.html')[0])
league_data_13["year"] = [2013 for k in range(0,league_data_13.GS.count())]
league_data_14 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2014_per_game.html')[0])
league_data_14["year"] = [2014 for k in range(0,league_data_14.GS.count())]
league_data_15 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2015_per_game.html')[0])
league_data_15["year"] = [2015 for k in range(0,league_data_15.GS.count())]
league_data_16 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2017_per_game.html')[0])
league_data_16["year"] = [2016 for k in range(0,league_data_16.GS.count())]
league_data_17 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2017_per_game.html')[0])
league_data_17["year"] = [2017 for k in range(0,league_data_17.GS.count())]
league_data_18 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2018_per_game.html')[0])
league_data_18["year"] = [2018 for k in range(0,league_data_18.GS.count())]
league_data_19 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2019_per_game.html')[0])
league_data_19["year"] = [2019 for k in range(0,league_data_19.GS.count())]
league_data_20 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2020_per_game.html')[0])
league_data_20["year"] = [2020 for k in range(0,league_data_20.GS.count())]
league_data_03 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2003_per_game.html')[0])
league_data_03["year"] = [2003 for k in range(0,league_data_03.GS.count())]
league_data_02 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_2002_per_game.html')[0])
league_data_02["year"] = [2002 for k in range(0,league_data_02.GS.count())]
league_data_98 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1998_per_game.html')[0])
league_data_98["year"] = [1998 for k in range(0,league_data_98.GS.count())]
league_data_97 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1997_per_game.html')[0])
league_data_97["year"] = [1997 for k in range(0,league_data_97.GS.count())]
league_data_96 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1996_per_game.html')[0])
league_data_96["year"] = [1996 for k in range(0,league_data_96.GS.count())]
league_data_93 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1993_per_game.html')[0])
league_data_93["year"] = [1993 for k in range(0,league_data_93.GS.count())]
league_data_92 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1992_per_game.html')[0])
league_data_92["year"] = [1992 for k in range(0,league_data_92.GS.count())]
league_data_91 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1991_per_game.html')[0])
league_data_91["year"] = [1991 for k in range(0,league_data_91.GS.count())]
league_data_90 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1990_per_game.html')[0])
league_data_90["year"] = [1990 for k in range(0,league_data_90.GS.count())]
league_data_89 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1989_per_game.html')[0])
league_data_89["year"] = [1989 for k in range(0,league_data_89.GS.count())]
league_data_88 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1988_per_game.html')[0])
league_data_88["year"] = [1988 for k in range(0,league_data_88.GS.count())]
league_data_87 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1987_per_game.html')[0])
league_data_87["year"] = [1987 for k in range(0,league_data_87.GS.count())]
league_data_85 = pd.DataFrame(pd.read_html('https://www.basketball-reference.com/leagues/NBA_1985_per_game.html')[0])
league_data_85["year"] = [1985 for k in range(0,league_data_85.GS.count())]
league_data_jordan = [league_data_85, league_data_87, league_data_88, league_data_89, league_data_90, league_data_91, league_data_92, league_data_93, league_data_96, league_data_97, league_data_98, league_data_02, league_data_03]
league_data_james = [league_data_04, league_data_05, league_data_06, league_data_07, league_data_08, league_data_09, league_data_10, league_data_11, league_data_12, league_data_13, league_data_14, league_data_15, league_data_16, league_data_17, league_data_18, league_data_19, league_data_20]
Make new DataFrame containing career data for regular season and playoffs, remove unnecessary rows (either nan or not enough games), and get rid of extra columns. I chose to just stick with the basic stats of points, assists, rebounds, blocks, steals, and turnovers.
#get row with career stats
jordan_career = jordan.loc[jordan.Season == 'Career']
#get playoff career stats
jordan_po_career = jordan_po.loc[jordan_po.Season == 'Career']
#removing rows that have unnecessary data (not enough games played or not data)
jordan.drop(index=[1,9,10,14,15,16,19,20,21,22], inplace=True)
jordan_po.drop(index=[1,9,13,14,15],inplace=True)
#drop extra columns, only working with Points, assists, rebounds, steals, blocks, turnovers
jordan.drop(columns = ['Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
jordan_career.drop(columns = ['Season','Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
jordan_po.drop(columns = ['Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
jordan_po_career.drop(columns = ['Season','Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
In the next few cells, I convert the data within the columns to numerical values to make future comparisons and manipulation easier.
jordan.AST = pd.to_numeric(jordan.AST)
jordan.STL = pd.to_numeric(jordan.STL)
jordan.BLK = pd.to_numeric(jordan.BLK)
jordan.PTS = pd.to_numeric(jordan.PTS)
jordan.TOV = pd.to_numeric(jordan.TOV)
jordan.TRB = pd.to_numeric(jordan.TRB)
jordan_po.AST = pd.to_numeric(jordan_po.AST)
jordan_po.STL = pd.to_numeric(jordan_po.STL)
jordan_po.BLK = pd.to_numeric(jordan_po.BLK)
jordan_po.PTS = pd.to_numeric(jordan_po.PTS)
jordan_po.TOV = pd.to_numeric(jordan_po.TOV)
jordan_po.TRB = pd.to_numeric(jordan_po.TRB)
jordan_career.AST = pd.to_numeric(jordan_career.AST)
jordan_career.STL = pd.to_numeric(jordan_career.STL)
jordan_career.BLK = pd.to_numeric(jordan_career.BLK)
jordan_career.PTS = pd.to_numeric(jordan_career.PTS)
jordan_career.TOV = pd.to_numeric(jordan_career.TOV)
jordan_career.TRB = pd.to_numeric(jordan_career.TRB)
jordan_po_career.AST = pd.to_numeric(jordan_po_career.AST)
jordan_po_career.STL = pd.to_numeric(jordan_po_career.STL)
jordan_po_career.BLK = pd.to_numeric(jordan_po_career.BLK)
jordan_po_career.PTS = pd.to_numeric(jordan_po_career.PTS)
jordan_po_career.TOV = pd.to_numeric(jordan_po_career.TOV)
jordan_po_career.TRB = pd.to_numeric(jordan_po_career.TRB)
Replace season string with integer of year season ended. This step isn't totally necessary. I made this change so that it would match the year column I added to the league data.
jordan.replace({
'Season':{'1984-85':1985,'1986-87':1987,'1987-88':1988,'1988-89':1989,'1989-90':1990,'1990-91':1991,'1991-92':1992,'1992-93':1993,'1995-96':1996,'1996-97':1997,'1997-98':1998,'2001-02':2002,'2002-03':2003}
},inplace=True)
jordan.reset_index(drop=True)
Season | Age | Pos | GS | TRB | AST | STL | BLK | TOV | PTS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1985 | 21.0 | SG | 82 | 6.5 | 5.9 | 2.4 | 0.8 | 3.5 | 28.2 |
1 | 1987 | 23.0 | SG | 82 | 5.2 | 4.6 | 2.9 | 1.5 | 3.3 | 37.1 |
2 | 1988 | 24.0 | SG | 82 | 5.5 | 5.9 | 3.2 | 1.6 | 3.1 | 35.0 |
3 | 1989 | 25.0 | SG | 81 | 8.0 | 8.0 | 2.9 | 0.8 | 3.6 | 32.5 |
4 | 1990 | 26.0 | SG | 82 | 6.9 | 6.3 | 2.8 | 0.7 | 3.0 | 33.6 |
5 | 1991 | 27.0 | SG | 82 | 6.0 | 5.5 | 2.7 | 1.0 | 2.5 | 31.5 |
6 | 1992 | 28.0 | SG | 80 | 6.4 | 6.1 | 2.3 | 0.9 | 2.5 | 30.1 |
7 | 1993 | 29.0 | SG | 78 | 6.7 | 5.5 | 2.8 | 0.8 | 2.7 | 32.6 |
8 | 1996 | 32.0 | SG | 82 | 6.6 | 4.3 | 2.2 | 0.5 | 2.4 | 30.4 |
9 | 1997 | 33.0 | SG | 82 | 5.9 | 4.3 | 1.7 | 0.5 | 2.0 | 29.6 |
10 | 1998 | 34.0 | SG | 82 | 5.8 | 3.5 | 1.7 | 0.5 | 2.3 | 28.7 |
11 | 2002 | 38.0 | SF | 53 | 5.7 | 5.2 | 1.4 | 0.4 | 2.7 | 22.9 |
12 | 2003 | 39.0 | SF | 67 | 6.1 | 3.8 | 1.5 | 0.5 | 2.1 | 20.0 |
jordan_po.replace({
'Season':{'1984-85':1985,'1986-87':1987,'1987-88':1988,'1988-89':1989,'1989-90':1990,'1990-91':1991,'1991-92':1992,'1992-93':1993,'1995-96':1996,'1996-97':1997,'1997-98':1998}
},inplace=True)
jordan_po.reset_index(drop=True)
Season | Age | Pos | GS | TRB | AST | STL | BLK | TOV | PTS | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1985 | 21.0 | SG | 4.0 | 5.8 | 8.5 | 2.8 | 1.0 | 3.8 | 29.3 |
1 | 1987 | 23.0 | SG | 3.0 | 7.0 | 6.0 | 2.0 | 2.3 | 2.7 | 35.7 |
2 | 1988 | 24.0 | SG | 10.0 | 7.1 | 4.7 | 2.4 | 1.1 | 3.9 | 36.3 |
3 | 1989 | 25.0 | SG | 17.0 | 7.0 | 7.6 | 2.5 | 0.8 | 4.0 | 34.8 |
4 | 1990 | 26.0 | SG | 16.0 | 7.2 | 6.8 | 2.8 | 0.9 | 3.5 | 36.7 |
5 | 1991 | 27.0 | SG | 17.0 | 6.4 | 8.4 | 2.4 | 1.4 | 2.5 | 31.1 |
6 | 1992 | 28.0 | SG | 22.0 | 6.2 | 5.8 | 2.0 | 0.7 | 3.7 | 34.5 |
7 | 1993 | 29.0 | SG | 19.0 | 6.7 | 6.0 | 2.1 | 0.9 | 2.4 | 35.1 |
8 | 1996 | 32.0 | SG | 18.0 | 4.9 | 4.1 | 1.8 | 0.3 | 2.3 | 30.7 |
9 | 1997 | 33.0 | SG | 19.0 | 7.9 | 4.8 | 1.6 | 0.9 | 2.6 | 31.1 |
10 | 1998 | 34.0 | SG | 21.0 | 5.1 | 3.5 | 1.5 | 0.6 | 2.1 | 32.4 |
Repeat tidying process used on Jordan data on data for LeBron.
james_career = james.loc[james.Season == 'Career']
james_po_career = james_po.loc[james_po.Season == 'Career']
james.drop(index=[17,18,19,20,21,22],inplace=True)
james_po.drop(index=[14,15,16,17,18],inplace=True)
james.drop(columns = ['Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
james_career.drop(columns = ['Season','Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
james_po.drop(columns = ['Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
james_po_career.drop(columns = ['Season','Team','Lg','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'],inplace=True)
james.replace({
'Season':{'2003-04':2004,'2004-05':2005,'2005-06':2006,'2006-07':2007,'2007-08':2008,'2008-09':2009,'2009-10':2010,'2010-11':2011, '2011-12':2012,'2012-13':2013,'2013-14':2014,'2014-15':2015,'2015-16':2016,'2016-17':2017,'2017-18':2018,'2018-19':2019,'2019-20':2020}
},inplace=True)
james.AST = pd.to_numeric(james.AST)
james.STL = pd.to_numeric(james.STL)
james.BLK = pd.to_numeric(james.BLK)
james.PTS = pd.to_numeric(james.PTS)
james.TOV = pd.to_numeric(james.TOV)
james.TRB = pd.to_numeric(james.TRB)
james_po.replace({
'Season':{'2005-06':2006,'2006-07':2007,'2007-08':2008,'2008-09':2009,'2009-10':2010,'2010-11':2011, '2011-12':2012,'2012-13':2013,'2013-14':2014,'2014-15':2015,'2015-16':2016,'2016-17':2017,'2017-18':2018,'2018-19':2019,'2019-20':2020}
},inplace=True)
james_po.AST = pd.to_numeric(james_po.AST)
james_po.STL = pd.to_numeric(james_po.STL)
james_po.BLK = pd.to_numeric(james_po.BLK)
james_po.PTS = pd.to_numeric(james_po.PTS)
james_po.TOV = pd.to_numeric(james_po.TOV)
james_po.TRB = pd.to_numeric(james_po.TRB)
james_career.AST = pd.to_numeric(james_career.AST)
james_career.STL = pd.to_numeric(james_career.STL)
james_career.BLK = pd.to_numeric(james_career.BLK)
james_career.PTS = pd.to_numeric(james_career.PTS)
james_career.TOV = pd.to_numeric(james_career.TOV)
james_career.TRB = pd.to_numeric(james_career.TRB)
james_po_career.AST = pd.to_numeric(james_po_career.AST)
james_po_career.STL = pd.to_numeric(james_po_career.STL)
james_po_career.BLK = pd.to_numeric(james_po_career.BLK)
james_po_career.PTS = pd.to_numeric(james_po_career.PTS)
james_po_career.TOV = pd.to_numeric(james_po_career.TOV)
james_po_career.TRB = pd.to_numeric(james_po_career.TRB)
jordan.rename(columns={'Season':'year','GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
jordan_po.rename(columns={'Season':'year','GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
jordan_career.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
jordan_po_career.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
james.rename(columns={'Season':'year','GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
james_po.rename(columns={'Season':'year','GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
james_career.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
james_po_career.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
Cleaning up data for the entire league. This includes removing unnecessary rows and columns, changing data to be numerical values, and getting rid of players who started fewer than half of the regular season games. This means that the only remaining players are ones who had a large enough sample of games played in any given season.
starters_mj = []
for df in league_data_jordan:
#remove rows whose values are just repeats of the column names
df = df.loc[df.Rk.str.contains('Rk')==False]
#convert columns of chosen data to numbers
df.GS = pd.to_numeric(df.GS)
df.TRB = pd.to_numeric(df.TRB)
df.AST = pd.to_numeric(df.AST)
df.STL = pd.to_numeric(df.STL)
df.BLK = pd.to_numeric(df.BLK)
df.PTS = pd.to_numeric(df.PTS)
df.TOV = pd.to_numeric(df.TOV)
#drop unwanted columns
df.drop(columns=['Rk','G','Tm','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'], inplace=True)
#For players who were on multiple teams, I only kept the data from the first instance of their name in the list.
df.drop_duplicates(subset = 'Player', inplace=True)
df = df.loc[df.GS >= 41].reset_index(drop=True)
df.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
starters_mj.append(df)
starters_leb = []
for df in league_data_james:
#remove rows whose values are just repeats of the column names
df = df.loc[df.Rk.str.contains('Rk')==False]
#convert games started to numbers
df.GS = pd.to_numeric(df.GS)
df.TRB = pd.to_numeric(df.TRB)
df.AST = pd.to_numeric(df.AST)
df.STL = pd.to_numeric(df.STL)
df.BLK = pd.to_numeric(df.BLK)
df.PTS = pd.to_numeric(df.PTS)
df.TOV = pd.to_numeric(df.TOV)
#drop unwanted columns
df.drop(columns=['Rk','Tm','G','MP','FG','FGA','3P','3PA','3P%','2P','2PA','2P%','FT','FTA','FT%','ORB','DRB','PF','FG%','eFG%'], inplace=True)
#For players who were on multiple teams, I only kept the data from the first instance of their name in the list.
df.drop_duplicates(subset = 'Player', inplace=True, ignore_index=True)
df = df.loc[df.GS >= 41].reset_index(drop=True)
df.rename(columns={'GS':'games_started','TRB':'rebounds','AST':'assists','STL':'steals','BLK':'blocks','TOV':'turnovers',"PTS":"points"},inplace=True)
starters_leb.append(df)
Add the individual DataFrames for each year to a single one containing information on the entire league.
ld_mj = starters_mj[0]
for i in range(1,len(starters_mj)):
ld_mj = ld_mj.append(starters_mj[i], ignore_index=True)
ld_leb = starters_leb[0]
for i in range(1,len(starters_leb)):
ld_leb = ld_leb.append(starters_leb[i],ignore_index=True)
We now have 4 Dataframes for both Lebron and Jordan: two for their stats each season (regular season and playoffs), and two for their career stats for both regular season and playoffs. In addition, there are 2 more DataFrames that cover the stats of the rest of the league in each season that they played.
All of the season DataFrames contain the following data:
The data for the whole league also includes the name of the players, and the career stats don't include year, age, or position.
Now that the data is in an easy to understand format, we can move on to creating charts of the data. These charts can help to provide evidence of which player is better.
In the graphs below, both players will be represented by the colors of the team they spent the most time on. For LeBron, I'm using the maroon and gold of the Cleveland Cavaliers, and for Jordan I'm using the red and black of the Chicago Bulls. When comparing data on both players in the same chart, I will use the gold for LeBron and red for Jordan.
In this project I will consider all stats examined to be relatively equivalent. By this I mean that I will consider a .5% difference in rebounds per game the same as .5% difference in points per game, steals, or any of the other stats.
For a player to be considered the GOAT, they need to play at a much higher level than the rest of the league. So the first charts will compare LeBron and MJ to the average of the starters in each year to see how they compare.
leb_yrs = james.year.values
#get average stats for starters (41 games started)
ld_leb_avg = ld_leb.groupby(['year']).agg({'points':'mean','rebounds':'mean','assists':'mean','steals':'mean','blocks':'mean','turnovers':'mean'})
#create DataFrame for each of the stats -> allows for easier comparing of individual stats to league averages
leb_ppg = pd.DataFrame({'LeBron':james.points.values,'League Avg': ld_leb_avg.points.values}, index=leb_yrs)
leb_apg = pd.DataFrame({'LeBron':james.assists.values, 'League Avg': ld_leb_avg.assists.values}, index = leb_yrs)
leb_rpg = pd.DataFrame({'LeBron':james.rebounds.values, 'League Avg': ld_leb_avg.rebounds.values}, index = leb_yrs)
leb_bpg = pd.DataFrame({'LeBron':james.blocks.values, 'League Avg': ld_leb_avg.blocks.values}, index = leb_yrs)
leb_spg = pd.DataFrame({'LeBron':james.steals.values, 'League Avg': ld_leb_avg.steals.values}, index = leb_yrs)
leb_tpg = pd.DataFrame({'LeBron':james.turnovers.values, 'League Avg': ld_leb_avg.turnovers.values}, index = leb_yrs)
#graphing player's stats against average
ppg_ax = leb_ppg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Points Per Game')
apg_ax = leb_apg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Assists Per Game')
rpg_ax = leb_rpg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Rebounds Per Game')
bpg_ax = leb_bpg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Blocks Per Game')
spg_ax = leb_spg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Steals Per Game')
tpg_ax = leb_tpg.plot.bar(rot=60,color=['#860038','#FDBB30'],title='Turnovers Per Game')
Better than Average in:
In this section I used the same exact process as the previous section, just on the data for Michael Jordan
mj_yrs = jordan.year.values
ld_mj_avg = ld_mj.groupby(['year']).agg({'points':'mean','rebounds':'mean','assists':'mean','steals':'mean','blocks':'mean','turnovers':'mean'})
mj_ppg = pd.DataFrame({'Jordan':jordan.points.values,'League Avg': ld_mj_avg.points.values}, index=mj_yrs)
mj_apg = pd.DataFrame({'Jordan':jordan.assists.values, 'League Avg': ld_mj_avg.assists.values}, index = mj_yrs)
mj_rpg = pd.DataFrame({'Jordan':jordan.rebounds.values, 'League Avg': ld_mj_avg.rebounds.values}, index = mj_yrs)
mj_bpg = pd.DataFrame({'Jordan':jordan.blocks.values, 'League Avg': ld_mj_avg.blocks.values}, index = mj_yrs)
mj_spg = pd.DataFrame({'Jordan':jordan.steals.values, 'League Avg': ld_mj_avg.steals.values}, index = mj_yrs)
mj_tpg = pd.DataFrame({'Jordan':jordan.turnovers.values, 'League Avg': ld_mj_avg.turnovers.values}, index = mj_yrs)
mj_ppg_ax = mj_ppg.plot.bar(rot=60,color=['#CE1141','black'],title='Points Per Game')
mj_apg_ax = mj_apg.plot.bar(rot=60,color=['#CE1141','black'],title='Assists Per Game')
mj_rpg_ax = mj_rpg.plot.bar(rot=60,color=['#CE1141','black'],title='Rebounds Per Game')
mj_bpg_ax = mj_bpg.plot.bar(rot=60,color=['#CE1141','black'],title='Blocks Per Game')
mj_spg_ax = mj_spg.plot.bar(rot=60,color=['#CE1141','black'],title='Steals Per Game')
mj_tpg_ax = mj_tpg.plot.bar(rot=60,color=['#CE1141','black'],title='Turnovers Per Game')
Better than Average in:
Based on these results, LeBron James has been slightly more dominant over the competition throughout his career than Michael Jordan.
An important characteristic of the GOAT is stepping up when it matters most: the playoffs. Not only are there much higher stakes, but the level of competition continuously increases, as only the best teams are playing. In order to see how the most important and difficult games impact the performance of these players, we will look at the percentage improvement from regular season to playoff stats.
#create DataFrame of with career regular season and playoff data as columns
lj = pd.DataFrame({
'Regular Season':[james_career.points.values[0],james_career.assists.values[0],james_career.rebounds.values[0],james_career.blocks.values[0],james_career.steals.values[0],james_career.turnovers.values[0]],
'Playoffs':[james_po_career.points.values[0],james_po_career.assists.values[0],james_po_career.rebounds.values[0],james_po_career.blocks.values[0],james_po_career.steals.values[0],james_po_career.turnovers.values[0]]
},index = ['Points','Assists','Rebounds','Blocks','Steals','Turnovers'])
lj
Regular Season | Playoffs | |
---|---|---|
Points | 27.0 | 28.8 |
Assists | 7.4 | 7.2 |
Rebounds | 7.4 | 9.0 |
Blocks | 0.8 | 1.0 |
Steals | 1.6 | 1.7 |
Turnovers | 3.5 | 3.7 |
I don't take the absolute value below because I care about which of the values is greater.
for row in lj.iterrows():
#get difference
dif = row[1]['Playoffs'] - row[1]['Regular Season']
#get average
avg = (row[1]['Playoffs'] + row[1]['Regular Season'])/2.0
#higher number of turnovers is bad, so take the negative of whatever the differnce is.
if row[0] == 'Turnovers':
avg = avg*(-1)
lj.at[row[0],'Percentage Improvement'] = 100.0*dif/avg
lj
Regular Season | Playoffs | Percentage Improvement | |
---|---|---|---|
Points | 27.0 | 28.8 | 6.451613 |
Assists | 7.4 | 7.2 | -2.739726 |
Rebounds | 7.4 | 9.0 | 19.512195 |
Blocks | 0.8 | 1.0 | 22.222222 |
Steals | 1.6 | 1.7 | 6.060606 |
Turnovers | 3.5 | 3.7 | -5.555556 |
lj_ax = lj.plot.bar(y=['Percentage Improvement'],color=['#FDBB30'],rot=0)
This shows that in the playoffs, LeBron does better in 4 of the 6 stats we are looking at. The 2 stats that he does worse in, assists and turnovers, are only slightly less than the regular season numbers, while the improved stats see a more drastic change.
Percentage Improvement
mj = pd.DataFrame({
'Regular Season':[jordan_career.points.values[0],jordan_career.assists.values[0],jordan_career.rebounds.values[0],jordan_career.blocks.values[0],jordan_career.steals.values[0],jordan_career.turnovers.values[0]],
'Playoffs':[jordan_po_career.points.values[0],jordan_po_career.assists.values[0],jordan_po_career.rebounds.values[0],jordan_po_career.blocks.values[0],jordan_po_career.steals.values[0],jordan_po_career.turnovers.values[0]]
},index = ['Points','Assists','Rebounds','Blocks','Steals','Turnovers'])
mj
Regular Season | Playoffs | |
---|---|---|
Points | 30.1 | 33.4 |
Assists | 5.3 | 5.7 |
Rebounds | 6.2 | 6.4 |
Blocks | 0.8 | 0.9 |
Steals | 2.3 | 2.1 |
Turnovers | 2.7 | 3.1 |
for row in mj.iterrows():
dif = row[1]['Playoffs'] - row[1]['Regular Season']
avg = (row[1]['Playoffs'] + row[1]['Regular Season'])/2.0
if row[0] == 'Turnovers':
avg = avg*(-1)
mj.at[row[0],'Percentage Improvement'] = 100.0*dif/avg
mj
Regular Season | Playoffs | Percentage Improvement | |
---|---|---|---|
Points | 30.1 | 33.4 | 10.393701 |
Assists | 5.3 | 5.7 | 7.272727 |
Rebounds | 6.2 | 6.4 | 3.174603 |
Blocks | 0.8 | 0.9 | 11.764706 |
Steals | 2.3 | 2.1 | -9.090909 |
Turnovers | 2.7 | 3.1 | -13.793103 |
mj_ax = mj.plot.bar(y=['Percentage Improvement'],color=['#CE1141'],rot=0)
Like LeBron, Jordan also does better in 4 of the 6 stats we are looking at. However the improvements by Jordan aren't as significant as LeBron's, while the negative changes are more pronounced.
Percentage Improvement
Now, let's compare their stats head to head. Because they have played a different number of seasons, we can't compare their performances on a season by season basis. Instead, we'll compare their career stats from both the regular season and the playoffs. We will also compare the differences between their regular season and playoff stats.
First we'll compare their career stats in the regular season.
lj_mj = pd.DataFrame({
'Jordan':[jordan_career.points.values[0],jordan_career.assists.values[0],jordan_career.rebounds.values[0],jordan_career.blocks.values[0],jordan_career.steals.values[0],jordan_career.turnovers.values[0]],
'James':[james_career.points.values[0],james_career.assists.values[0],james_career.rebounds.values[0],james_career.blocks.values[0],james_career.steals.values[0],james_career.turnovers.values[0]]
},index=['Points','Assists','Rebounds','Blocks','Steals','Turnovers'])
lj_mj
Jordan | James | |
---|---|---|
Points | 30.1 | 27.0 |
Assists | 5.3 | 7.4 |
Rebounds | 6.2 | 7.4 |
Blocks | 0.8 | 0.8 |
Steals | 2.3 | 1.6 |
Turnovers | 2.7 | 3.5 |
Because the values for each category are so different, simply comparing the absolute difference between the two values wouldn't give an accurate picture of the scenario. For example, a difference of 3 in Points where both players are close to 30 is far less significant that a difference of 3 in rebounds or assists where the totals are much lower. In order to get around this issue, we can instead look at the percentage difference for each of the stats.
Percentage difference is calculated by taking the absolute value of the difference of the two values, and dividing by the average of the two numbers. An example using the data for Points: |30.1 - 27.0| / [(30.1 + 27.0) / 2] x 100% = 10.86%
for row in lj_mj.iterrows():
dif = abs(row[1].Jordan - row[1].James)
avg = (row[1].Jordan + row[1].James)/2.0
lj_mj.at[row[0],'Percentage Difference'] = 100.0*dif/avg
lj_mj
Jordan | James | Percentage Difference | |
---|---|---|---|
Points | 30.1 | 27.0 | 10.858144 |
Assists | 5.3 | 7.4 | 33.070866 |
Rebounds | 6.2 | 7.4 | 17.647059 |
Blocks | 0.8 | 0.8 | 0.000000 |
Steals | 2.3 | 1.6 | 35.897436 |
Turnovers | 2.7 | 3.5 | 25.806452 |
plt.bar(x = lj_mj.index, height = lj_mj['Percentage Difference'],color = ['#CE1141','#FDBB30','#FDBB30','#CE1141','#CE1141','#CE1141'])
plt.title("Regular Season Percentage Differences")
plt.show()
Based on the graph above, it seems pretty clear that Michael Jordan has LeBron James beat in terms of regular season stats. Not only does he have better stats in 3 of the 6 categories, the highest percentage difference is in steals, a stat where he has the advantage.
Percentage Difference
Here we will repeat the process used for the regular season using their career playoff stats.
lj_mj_po = pd.DataFrame({
'Jordan':[jordan_po_career.points.values[0],jordan_po_career.assists.values[0],jordan_po_career.rebounds.values[0],jordan_po_career.blocks.values[0],jordan_po_career.steals.values[0],jordan_po_career.turnovers.values[0]],
'James':[james_po_career.points.values[0],james_po_career.assists.values[0],james_po_career.rebounds.values[0],james_po_career.blocks.values[0],james_po_career.steals.values[0],james_po_career.turnovers.values[0]]
},index=['Points','Assists','Rebounds','Blocks','Steals','Turnovers'])
lj_mj_po
Jordan | James | |
---|---|---|
Points | 33.4 | 28.8 |
Assists | 5.7 | 7.2 |
Rebounds | 6.4 | 9.0 |
Blocks | 0.9 | 1.0 |
Steals | 2.1 | 1.7 |
Turnovers | 3.1 | 3.7 |
for row in lj_mj_po.iterrows():
dif = abs(row[1].Jordan - row[1].James)
avg = (row[1].Jordan + row[1].James)/2.0
lj_mj_po.at[row[0],'Percentage Difference'] = 100.0*dif/avg
lj_mj_po
Jordan | James | Percentage Difference | |
---|---|---|---|
Points | 33.4 | 28.8 | 14.790997 |
Assists | 5.7 | 7.2 | 23.255814 |
Rebounds | 6.4 | 9.0 | 33.766234 |
Blocks | 0.9 | 1.0 | 10.526316 |
Steals | 2.1 | 1.7 | 21.052632 |
Turnovers | 3.1 | 3.7 | 17.647059 |
plt.bar(x = lj_mj_po.index, height = lj_mj_po['Percentage Difference'],color = ['#CE1141','#FDBB30','#FDBB30','#FDBB30','#CE1141','#CE1141'])
plt.title("Playoff Percentage Differences")
plt.show()
Percentage Difference
After shifting the focus from regular season stats to postseason stats, LeBron takes the lead in terms of overall percentage difference. Jordan's advantage for the regular season was larger than LeBron's lead for playoffs, but I think it is fair to say that the importance of the playoffs at least balances out these results.
Here we will look at the difference between how much both players stats improved in the playoffs.
po_dif = pd.DataFrame({
'Jordan':mj['Percentage Improvement'],
'James':lj['Percentage Improvement']
}, index=['Points','Assists','Rebounds','Blocks','Steals','Turnovers'])
po_dif
Jordan | James | |
---|---|---|
Points | 10.393701 | 6.451613 |
Assists | 7.272727 | -2.739726 |
Rebounds | 3.174603 | 19.512195 |
Blocks | 11.764706 | 22.222222 |
Steals | -9.090909 | 6.060606 |
Turnovers | -13.793103 | -5.555556 |
po_dif_ax = po_dif.plot.bar(rot=0,color=['#CE1141','#FDBB30'])
This graph makes it pretty clear that when it comes to the biggest games and series, LeBron James steps up more than Michael Jordan.
Throughout their entire careers, both LeBron James and Michael Jordan have had incredible stats and dominated the game. Although Jordan's raw numbers are better in the regular season, LeBron is above average in more areas slightly more often than Jordan. I think the stats show that Michael Jordan is a scoring specialist, which is to be expected given that he is a 10x scoring champ. While LeBron's scoring is nothing to scoff at, it isn't on the same level as Jordan's. LeBron has been consistently well above average in almost every stat. When just looking at the numbers, Jordan appears to have the slight edge for the regular season, while LeBron has the advantage in the playoffs. If I had to chose solely on the data looked at here in a vacuum, I think I would give the slight edge to LeBron James, but the evidence is nowhere near strong enough to make a confident claim.
As in almost everything though, the statistics don't tell the full story. There are certain intangibles like how clutch a player is, that are very difficult to represent numerically. There are also myriad other factors that impact the success of any NBA player and his team, such as the strength of the opposition, injuries, etc.
Even though Michael Jordan has more championships than LeBron, I would that this doesn't matter, because of the 2015-2016 season. That year the defending champions, the Golden State Warriors, set the NBA record for the best regular season record, 73-9. After being down 3-1 in the best of 7 series, Cleveland came back and won the title. The biggest reason this was possible was becuase of LeBron James. During that series, LeBron pulled off what is likely the most impressive feat in the history of professional basketball. At the end of the series, counting players on both teams, LeBron had the most points, assists, rebounds, steals, and blocks. The ability to completely dominate an entire series against one of the if not the best team of all time is something that only LeBron James has done. When considered with what I would argue is a slight statistical advantage, I feel quite confident in saying that LeBron James is in fact the GOAT.