Predicting Behaviors of Basketball Players from First Person Videos

1Shan Su, 2Jung Pyo Hong, 1Jianbo Shi, and 3Hyun Soo Park

1University of Pennsylvania
2Korea Advanced Institute of Science & Technology
3University of Minnesota

Figure 1: We predict basketball players future location and gaze direction up to 5 seconds. The first column and top row: a comparison between the predicted trajectories with gaze directions in blue with ground truth trajectory in red. First column and bottom row: a comparison between the predicted joint attention in green with the ground truth joint attention in orange. Second column: a comparison between a target sequence (odd rows) and the retrieved sequence (even rows). The retrieved sequence has similar social configuration as time evolves. The predicted joint attention are projected onto the target sequence to validate the prediction. The joint attention agrees with scene activities. 


This paper presents a method to predict the future movements (location and gaze direction) of basketball players as a whole from their first person videos. The predicted behaviors reflect an individual physical space that affords to take the next actions while conforming to social behaviors by engaging to joint attention. Our key innovation is to use the 3D reconstruction of multiple first person cameras to automatically annotate each other's visual semantics of social configurations.

We leverage two learning signals uniquely embedded in first person videos. Individually, a first person video records the visual semantics of a spatial and social layout around a person that allows associating with past similar situations. Collectively, first person videos follow joint attention that can link the individuals to a group. 

We learn the egocentric visual semantics of group movements using a Siamese neural network to retrieve future trajectories. We consolidate the retrieved trajectories from all players by maximizing a measure of social compatibility---the gaze alignment towards joint attention predicted by their social formation, where the dynamics of joint attention is learned by a long-term recurrent convolutional network. This allows us to characterize which social configuration is more plausible and predict future group trajectories.


Shan Su, Jung Pyo Hong, Jianbo Shi, and Hyun Soo Park "Predicting  Behaviors  of  Basketball Players from First Person Videos" Conference on Computer Vision and Pattern Recognition (CVPR) (spotlight), 2017 [paper, presentation, video]



Coming soon


Shan Su,