本文参考,并将其延伸至2018-2020 赛季。全文用rmd 完成,ggplot2为主要做图工具。主要想研究下各个球队和伤病之间的关系
https://www.kaggle.com/jaseziv83/extensive-nba-injuries-deep-dive-eda/comments
数据下载
https://www.kaggle.com/ghopkins/nba-injuries-2010-2018
数据清理
注: 这里的Acquired 和 Relinquished是呈现互补关系。若Acquired非空,则Relinquished 空


最后的记录是到2020-10-2.
发现存在很多空的列表,看下个数 (这里有一个坑就是这里的缺失值不是NA,而是“”。)
1
| print(colSums(df == ""))
|

再看一下总体数据的size

Team 列查看
1
| df %>% dplyr::filter(Team == "")
|

这两例情况其实都是受伤了,被球队裁掉了
这里以大宝贝为例

所以我们可以大胆的排除掉这两种情况
Acquired列查看
1 2 3 4
| df %>% filter(Acquired != "") %>% count(Notes) %>% arrange(desc(n))
|

发现这是一些恢复的数据,而我们的目的是找到一些球队和伤病的问题,所以我们可以单纯的不考虑这些。正如我前面提到的那样,Acquired 和 Relinquished 是互补的,所以我们可以如下处理数据
1 2 3 4 5
| df_exc <- df %>% filter(Acquired == "") %>% filter(Team != "") %>% select(-Acquired) "直接删掉Acquired这一个变量" print(colSums(df_exc == ""))
|

还有四个漏网之鱼,也处理掉
1 2 3
| df <- df_exc %>% filter(Relinquished != "") print(colSums(df == ""))
|

到目前为止,缺失值已经处理完毕了
以时间为单位来进行处理(转换成Date格式)
1 2 3
| df$year <- format(as.Date(df$Date), "%Y") df$yr_mo <- format(as.Date(df$Date), "%Y-%m") df$month <- month(as.Date(df$Date), TRUE)
|
看一下这十年究竟谁伤病情况(包含轮休)的最多
1 2 3 4 5 6 7 8
| df %>% count(Team) %>% ggplot(aes(x = reorder(Team, n), y = n)) + geom_col(fill = "midnightblue", color = "white") + coord_flip() + theme_bw() + labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events") + theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
|

前三名分别是雄鹿,马刺和火箭。
由于联盟在2018 年发布了新的政策,对非伤轮休的球队会处以罚款,之前的记录可能有一些水分。所以我又做了2018年之后的数据
1 2 3 4 5 6 7 8 9
| df %>% filter(yr_mo > "2018-09") %>% count(Team) %>% ggplot(aes(x = reorder(Team, n), y = n)) + geom_col(fill = "midnightblue", color = "white") + coord_flip() + theme_bw() + labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events since 2018-2019 season") + theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
|

马刺在本次赛季失去效力了十七年的后卫托尼·帕克,他在休赛期间选择与夏洛特黄蜂签约,而杰弗瑞·罗德曼吉、凯尔·安德森以及布兰登·保罗等重要角色球员也先后离队。这也是自2002–03赛季以来第一次没有马努·吉诺比利的赛季,他在2018年8月27日选择对外宣布退役,因此这也标志着马刺的“三巨头时代”至此已经彻底结束。2018年7月18日,马刺管理层将科怀·伦纳德和丹尼·格林交易到多伦多猛龙,以换取德玛尔·德罗赞、雅各·伯尔特和一枚受保护的2019年首轮选秀签。
新鲜血液的加入以及整体阵容的年轻化,马刺在2018-2019赛季几乎全员健康,受伤病影响最小。
此时受伤病影响最严重的前三只球队分别是凯尔特人,雄鹿和勇士。
十年内哪位球员受伤病影响(含轮休)最多
1 2 3 4 5 6 7 8 9 10 11
| df %>% count(Relinquished) %>% arrange(desc(n)) %>% top_n(n = 20) %>% ggplot(aes(x = reorder(Relinquished, n), y = n)) + geom_col(fill = "midnightblue", color = "white") + coord_flip() + theme_bw() + labs(x = "Player", y = "Number of injury events on players", title = "Which player has the most injury events") + theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
|

乐福,戈登和罗斯是近十年来受伤病影响最大的 (轮休也是预防伤病)
每年nba球员受伤病伤病影响的数量
1 2 3 4 5 6 7 8
| df %>% count(year, Relinquished) %>% ggplot(aes(x = n)) + geom_density(color = "midnightblue", fill = "white") + facet_wrap(~year, scales = "free_y") + theme_bw() + labs(y = "", x = "Number of events", title = "Number of events per player", subtitle = "Note: 2010 & 2020 not a full year") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"))
|

每年的分布趋势都大致相同
NBA 球员每年受伤数量汇总(折线图)
1 2 3 4 5 6 7 8 9 10
| df %>% count(year) %>% ggplot(aes(x = year, y = n, group = 1)) + ylim(0, 2700) + geom_line(size = 1, color = "darkred") + geom_point(size = 2, color = "midnightblue") + geom_text(color = "midnightblue", aes(label = n, y = n + 100)) + theme_classic() + labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2018 not a full year", x = "Calendar Year", y = "Number events") + theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),)
|

以月为单位进行分析
1 2 3 4 5 6 7 8 9 10
| df %>% count(month) %>% ggplot(aes(x = month, y = n, group = 1)) + ylim(0, 4000) + geom_line(size = 1, color = "darkred") + geom_point(size = 2, color = "midnightblue") + geom_text(color = "midnightblue", aes(label = n, y = n + 100)) + theme_classic() + labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2020 not a full year", x = "Calendar Year", y = "Number events") + theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
|

注意:季后赛通常是四月中旬,所以一些轮休的球员,一些小伤病的球员通常会选择在季后赛前复出,所以伤病情况会有一个显著的下降,然后是休赛期,到了九月联赛开打,不可避免的伤病会逐渐增多