本文参考,并将其延伸至2018-2020 赛季。全文用rmd 完成,ggplot2为主要做图工具。主要想研究下各个球队和伤病之间的关系
https://www.kaggle.com/jaseziv83/extensive-nba-injuries-deep-dive-eda/comments
数据下载
https://www.kaggle.com/ghopkins/nba-injuries-2010-2018
数据清理
注: 这里的Acquired 和 Relinquished是呈现互补关系。若Acquired非空,则Relinquished 空


最后的记录是到2020-10-2.
发现存在很多空的列表,看下个数 (这里有一个坑就是这里的缺失值不是NA,而是“”。)
1
   | print(colSums(df == ""))
   | 
 

再看一下总体数据的size

Team 列查看
1
   | df %>% dplyr::filter(Team == "")
   | 
 

这两例情况其实都是受伤了,被球队裁掉了
这里以大宝贝为例

所以我们可以大胆的排除掉这两种情况
Acquired列查看
1 2 3 4
   | df %>%   filter(Acquired != "") %>%   count(Notes) %>%   arrange(desc(n))
   | 
 

发现这是一些恢复的数据,而我们的目的是找到一些球队和伤病的问题,所以我们可以单纯的不考虑这些。正如我前面提到的那样,Acquired 和 Relinquished 是互补的,所以我们可以如下处理数据
1 2 3 4 5
   | df_exc <- df %>%   filter(Acquired == "") %>%    filter(Team != "") %>%    select(-Acquired) "直接删掉Acquired这一个变量" print(colSums(df_exc == "")) 
   | 
 

还有四个漏网之鱼,也处理掉
1 2 3
   |  df <- df_exc %>% filter(Relinquished != "") print(colSums(df == ""))
 
  | 
 

到目前为止,缺失值已经处理完毕了
以时间为单位来进行处理(转换成Date格式)
1 2 3
   | df$year <- format(as.Date(df$Date), "%Y") df$yr_mo <- format(as.Date(df$Date), "%Y-%m") df$month <- month(as.Date(df$Date), TRUE)
   | 
 
看一下这十年究竟谁伤病情况(包含轮休)的最多
1 2 3 4 5 6 7 8
   | df %>%   count(Team) %>%   ggplot(aes(x = reorder(Team, n), y = n)) +   geom_col(fill = "midnightblue", color = "white") +   coord_flip() +   theme_bw() +   labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events") +   theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
   | 
 

前三名分别是雄鹿,马刺和火箭。
由于联盟在2018 年发布了新的政策,对非伤轮休的球队会处以罚款,之前的记录可能有一些水分。所以我又做了2018年之后的数据
1 2 3 4 5 6 7 8 9
   | df %>%   filter(yr_mo > "2018-09") %>%   count(Team) %>%   ggplot(aes(x = reorder(Team, n), y = n)) +   geom_col(fill = "midnightblue", color = "white") +   coord_flip() +   theme_bw() +   labs(x = "Team", y = "Number of injury events", title = "Which team has the most injury events since 2018-2019 season") +   theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
   | 
 

马刺在本次赛季失去效力了十七年的后卫托尼·帕克,他在休赛期间选择与夏洛特黄蜂签约,而杰弗瑞·罗德曼吉、凯尔·安德森以及布兰登·保罗等重要角色球员也先后离队。这也是自2002–03赛季以来第一次没有马努·吉诺比利的赛季,他在2018年8月27日选择对外宣布退役,因此这也标志着马刺的“三巨头时代”至此已经彻底结束。2018年7月18日,马刺管理层将科怀·伦纳德和丹尼·格林交易到多伦多猛龙,以换取德玛尔·德罗赞、雅各·伯尔特和一枚受保护的2019年首轮选秀签。
新鲜血液的加入以及整体阵容的年轻化,马刺在2018-2019赛季几乎全员健康,受伤病影响最小。
此时受伤病影响最严重的前三只球队分别是凯尔特人,雄鹿和勇士。
十年内哪位球员受伤病影响(含轮休)最多
1 2 3 4 5 6 7 8 9 10 11
   | df %>%      count(Relinquished) %>%   arrange(desc(n)) %>%   top_n(n = 20) %>%   ggplot(aes(x = reorder(Relinquished, n), y = n)) +   geom_col(fill = "midnightblue", color = "white") +   coord_flip() +   theme_bw() +   labs(x = "Player", y = "Number of injury events on players", title = "Which player has the most injury events") +   theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5))
   | 
 

乐福,戈登和罗斯是近十年来受伤病影响最大的 (轮休也是预防伤病)
每年nba球员受伤病伤病影响的数量
1 2 3 4 5 6 7 8
   | df %>% count(year, Relinquished) %>% ggplot(aes(x = n)) + geom_density(color = "midnightblue", fill = "white") + facet_wrap(~year, scales = "free_y") + theme_bw() + labs(y = "", x = "Number of events", title = "Number of events per player", subtitle = "Note: 2010 & 2020 not a full year") + theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5), panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"))
   | 
 

每年的分布趋势都大致相同
NBA 球员每年受伤数量汇总(折线图)
1 2 3 4 5 6 7 8 9 10
   | df %>%   count(year) %>%   ggplot(aes(x = year, y = n, group = 1)) +   ylim(0, 2700) +   geom_line(size = 1, color = "darkred") +   geom_point(size = 2, color = "midnightblue") +   geom_text(color = "midnightblue", aes(label = n, y = n + 100)) +   theme_classic() +   labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2018 not a full year", x = "Calendar Year", y = "Number events") +   theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5),)
   | 
 

以月为单位进行分析
1 2 3 4 5 6 7 8 9 10
   | df %>%   count(month) %>%   ggplot(aes(x = month, y = n, group = 1)) +   ylim(0, 4000) +   geom_line(size = 1, color = "darkred") +   geom_point(size = 2, color = "midnightblue") +   geom_text(color = "midnightblue", aes(label = n, y = n + 100)) +   theme_classic() +   labs(title = "Tracking injury events over time", subtitle = "Note: 2010 & 2020 not a full year", x = "Calendar Year", y = "Number events") +   theme(panel.grid.major.y = element_blank(), panel.background = element_rect(fill = "whitesmoke"), plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5))
   | 
 

注意:季后赛通常是四月中旬,所以一些轮休的球员,一些小伤病的球员通常会选择在季后赛前复出,所以伤病情况会有一个显著的下降,然后是休赛期,到了九月联赛开打,不可避免的伤病会逐渐增多