NBA kaggle analysis
Yuxuan Wu Lv13

本文参考,并将其延伸至2018-2020 赛季。全文用rmd 完成,ggplot2为主要做图工具。主要想研究下各个球队和伤病之间的关系

https://www.kaggle.com/jaseziv83/extensive-nba-injuries-deep-dive-eda/comments

数据下载

https://www.kaggle.com/ghopkins/nba-injuries-2010-2018

数据清理

  • 返回头部数据看下格式

注: 这里的Acquired 和 Relinquished是呈现互补关系。若Acquired非空,则Relinquished 空

img

img

最后的记录是到2020-10-2.

发现存在很多空的列表,看下个数 (这里有一个坑就是这里的缺失值不是NA,而是“”。)

1
print(colSums(df == ""))

img

再看一下总体数据的size

1
glimpse(df)

img

Team 列查看

1
df %>% dplyr::filter(Team == "")

img

这两例情况其实都是受伤了,被球队裁掉了

这里以大宝贝为例

img

所以我们可以大胆的排除掉这两种情况

Acquired列查看

1
2
3
4
df %>%
filter(Acquired != "") %>%
count(Notes) %>%
arrange(desc(n))

img

发现这是一些恢复的数据,而我们的目的是找到一些球队和伤病的问题,所以我们可以单纯的不考虑这些。正如我前面提到的那样,Acquired 和 Relinquished 是互补的,所以我们可以如下处理数据

1
2
3
4
5
df_exc <- df %>%
filter(Acquired == "") %>% # 排除Acquired==““
filter(Team != "") %>% # 排除Team==““
select(-Acquired) "直接删掉Acquired这一个变量"
print(colSums(df_exc == "")) # 检查一下还有没有多的缺失值

img

还有四个漏网之鱼,也处理掉

1
2
3
### Exclude some missing values in Relinquished variable
df <- df_exc %>% filter(Relinquished != "")
print(colSums(df == ""))

img

到目前为止,缺失值已经处理完毕了

以时间为单位来进行处理(转换成Date格式)

1
2
3
df$year <- format(as.Date(df$Date), "%Y")
df$yr_mo <- format(as.Date(df$Date), "%Y-%m")
df$month <- month(as.Date(df$Date), TRUE)

看一下这十年究竟谁伤病情况(包含轮休)的最多

1
2
3
4
5
6
7
8
df %>%
count(Team) %>%
ggplot(aes(x = reorder(Team, n), y = n)) +
geom_col(fill = "midnightblue", color = "white") +
coord_flip() +
theme_bw() +
labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))

img

前三名分别是雄鹿,马刺和火箭。

由于联盟在2018 年发布了新的政策,对非伤轮休的球队会处以罚款,之前的记录可能有一些水分。所以我又做了2018年之后的数据

1
2
3
4
5
6
7
8
9
df %>%
filter(yr_mo > "2018-09") %>%
count(Team) %>%
ggplot(aes(x = reorder(Team, n), y = n)) +
geom_col(fill = "midnightblue", color = "white") +
coord_flip() +
theme_bw() +
labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events since 2018-2019 season") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))

img

马刺在本次赛季失去效力了十七年的后卫托尼·帕克,他在休赛期间选择与夏洛特黄蜂签约,而杰弗瑞·罗德曼吉、凯尔·安德森以及布兰登·保罗等重要角色球员也先后离队。这也是自2002–03赛季以来第一次没有马努·吉诺比利的赛季,他在2018年8月27日选择对外宣布退役,因此这也标志着马刺的“三巨头时代”至此已经彻底结束。2018年7月18日,马刺管理层将科怀·伦纳德和丹尼·格林交易到多伦多猛龙,以换取德玛尔·德罗赞、雅各·伯尔特和一枚受保护的2019年首轮选秀签。

新鲜血液的加入以及整体阵容的年轻化,马刺在2018-2019赛季几乎全员健康,受伤病影响最小。

此时受伤病影响最严重的前三只球队分别是凯尔特人,雄鹿和勇士。

十年内哪位球员受伤病影响(含轮休)最多

1
2
3
4
5
6
7
8
9
10
11
df %>%
#filter(yr_mo > "2018-09" & yr_mo< "2019-09") %>%
count(Relinquished) %>%
arrange(desc(n)) %>%
top_n(n = 20) %>%
ggplot(aes(x = reorder(Relinquished, n), y = n)) +
geom_col(fill = "midnightblue", color = "white") +
coord_flip() +
theme_bw() +
labs(x = "Player", y = "Number of injury events on players", title = "Which player has the most injury events") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))

img

乐福,戈登和罗斯是近十年来受伤病影响最大的 (轮休也是预防伤病)

每年nba球员受伤病伤病影响的数量

1
2
3
4
5
6
7
8
df %>%
count(year, Relinquished) %>%
ggplot(aes(x = n)) +
geom_density(color = "midnightblue", fill = "white") +
facet_wrap(~year, scales = "free_y") +
theme_bw() +
labs(y = "", x = "Number of events", title = "Number of events per player", subtitle = "Note: 2010 & 2020 not a full year") +
theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"))

img

每年的分布趋势都大致相同

NBA 球员每年受伤数量汇总(折线图)

1
2
3
4
5
6
7
8
9
10
df %>%
count(year) %>%
ggplot(aes(x = year, y = n, group = 1)) +
ylim(0, 2700) +
geom_line(size = 1, color = "darkred") +
geom_point(size = 2, color = "midnightblue") +
geom_text(color = "midnightblue", aes(label = n, y = n + 100)) +
theme_classic() +
labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2018 not a full year", x = "Calendar Year", y = "Number events") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),)

img

以月为单位进行分析

1
2
3
4
5
6
7
8
9
10
df %>%
count(month) %>%
ggplot(aes(x = month, y = n, group = 1)) +
ylim(0, 4000) +
geom_line(size = 1, color = "darkred") +
geom_point(size = 2, color = "midnightblue") +
geom_text(color = "midnightblue", aes(label = n, y = n + 100)) +
theme_classic() +
labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2020 not a full year", x = "Calendar Year", y = "Number events") +
theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))

img

注意:季后赛通常是四月中旬,所以一些轮休的球员,一些小伤病的球员通常会选择在季后赛前复出,所以伤病情况会有一个显著的下降,然后是休赛期,到了九月联赛开打,不可避免的伤病会逐渐增多

  • Post title:NBA kaggle analysis
  • Post author:Yuxuan Wu
  • Create time:2021-01-31 02:01:35
  • Post link:yuxuanwu17.github.io2021/01/31/nba伤病分析/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.