# Summary of ICRA 2020 driverless trajectory prediction competition champion methods

Posted Jun 15, 2020 • 9 min read

The problem of pedestrian trajectory prediction is an important part of unmanned driving technology and has become a research hotspot in recent years. At ICRA 2020, the international top conference in the field of robotics, the Meituan unmanned delivery team won the pedestrian trajectory prediction competition. This article is a summary of some of the prediction methods, and I hope it can be helpful or inspiring to everyone.

- Background

On June 2, ICRA 2020, the top international conference, held the "Second Long-term Human Motion Prediction Seminar." The seminar was organized by Bosch GmbH, Örebro University, University of Stuttgart, and the Swiss Federal Institute of Technology in Lausanne. At the same time, a pedestrian trajectory prediction competition was also held at the seminar, attracting 104 teams from all over the world. Participate. The unmanned delivery team of Meituan won the first place in the competition by adopting the interactive prediction method of "world model".

- Introduction

This competition provides pedestrian trajectory data sets in ten complex scenes such as streets, entrances, and campuses. Participants are required to use the pedestrian trajectories in the past 3.6 seconds to predict their running trajectories in the next 4.8 seconds based on these data sets. The competition uses FDE(the distance between the end point of the predicted trajectory and the real trajectory) to rank various algorithms.

The data set of the contest questions is mainly derived from real annotation data and simulated synthetic data in various dynamic scenarios. The acquisition frequency is 2.5 Hz, that is, the time difference between the two moments is 0.4 seconds. Pedestrian trajectories in the data set are represented by a sequence of time series coordinates in a fixed coordinate system, and these trajectories are classified into different categories according to the surrounding environment of pedestrians, such as static obstacles, linear motion, following motion, obstacle avoidance behavior, and group sports Wait. In this competition, the participating teams need to predict the trajectory at 12 moments in the future(corresponding to 4.8 seconds) based on the trajectory data at 9 moments in each obstacle history(corresponding to 3.6 seconds).

The competition uses a variety of evaluation indicators, which evaluate the single-modal prediction model and the multi-modal prediction model. The single-modal model refers to a given historical trajectory, the prediction algorithm only outputs a certain trajectory; and the multi-modal model outputs multiple feasible trajectories(or distributions). The ranking of this competition is based on the FDE index in the single-modal index.

- Introduction

In fact, Meituan often deals with pedestrian trajectory prediction in many practical businesses. The difficulty of pedestrian trajectory prediction is how to model the social behavior of pedestrians in a dynamic and complex environment. Because in complex scenes, the interaction between pedestrians is very frequent and the result of the interaction will directly affect their subsequent movements(such as deceleration and yielding, detour obstacle avoidance, acceleration obstacle avoidance, etc.).

Based on various interactive data sets, a series of algorithms are proposed one after another, and then interactive prediction of obstacles is performed. The focus of these mainstream models is to model the interaction between people in complex scenes. Commonly used methods include interactive algorithms based on LSTM(SR LSTM[1], Social GAN[2], SoPhie[3], Peeking into[4], StarNet[5], etc.), Interactive algorithms based on Graph/Attention(GRIP[6], Social STGCNN[7], STGAT[8], VectorNet[9], etc.), and prediction algorithms based on semantic maps/raw data, etc. .

Our entry method is improved from the self-developed algorithm [10](as shown in Figure 2). The design idea of this method is to build on the historical trajectory, tracking information and scene information of all obstacles in the scene. And maintain a global model of the world to explore the interaction characteristics between obstacles and between obstacles and the environment. Then, query the world model to obtain the interactive features in the neighborhood of each location, and then guide the prediction of obstacles.

In the actual operation process, due to the lack of scene information in the data set, we made appropriate adjustments to the model. In the world model(corresponding to Interaction Net in the above figure), we only use the existing data set, and the location information and tracking information LSTM hidden state information that the model can provide. The resulting model structure design is shown in Figure 3 below:

The entire model is based on the Seq2Seq structure and mainly includes three parts:the historical trajectory encoding module(Encoder), the world model(Interaction Module), and the decoding prediction module(Decoder). Among them, the function of the encoder is to encode the pedestrian's historical trajectory, mainly to extract the pedestrian's movement pattern in the dynamic environment; the decoder is to use the pedestrian movement pattern characteristics obtained by the encoder to predict their future movement trajectory distribution. It needs to be emphasized that in the entire process of encoding and decoding, two operations of updating and querying the world model in real time are required. The update operation is mainly based on the progress of time series, and the movement information of pedestrians is compiled into the world model in real time; the query operation is based on the global world map and the pedestrian's own position to obtain the environmental characteristics of the pedestrian's current neighborhood.

In Figure 4, the calculation process of our model in the historical trajectory coding stage is shown. There are 9 moments in the encoding phase, corresponding to 9 historical observation time points, and the same operation is performed at each moment. Take time t as an example.

First, all pedestrian coordinate data at time t, including:

Input the above information into the world model to update the map information, that is, Update operation. The entire Update operation obtains a global spatio-temporal map feature R through modules such as MLP, MaxPooling, and GRU; then, each LSTM(corresponding to a pedestrian) uses the coordinate information of its current observation time:

The process of decoding prediction stage is basically the same as the historical track encoding stage, but there are two subtle differences:

- Difference 1:The initialization of the LSTM hidden state corresponding to each pedestrian in the encoding stage is 0; while in the decoding stage, the LSTM is initialized by the LSTM hidden state and noise in the encoding stage.
- Difference 2:The LSTM and the world model corresponding to pedestrians in the encoding phase use historical observation coordinates of pedestrians; while the decoding phase uses the pedestrian coordinates predicted at the previous time.

- Data pre-processing and post-processing

In order to have a better understanding of the data and to use more suitable models, we did some preprocessing operations on the training data. First, the data set gives the behavior labels of each pedestrian. These labels are obtained according to the rules. Since we use the interactive prediction method, we hope that the model can automatically learn the positional relationship and speed relationship between pedestrians and surrounding subjects. We do not directly use the "type" information in the annotations; then the data for this game is collected from the trajectory of pedestrians in different scenarios such as roads and campuses. The difference between the scenes is very large, and the data distribution of the training set and the test set is not consistent.

Therefore, we did data visualization work, placed the starting point of all trajectory data at the origin of the coordinate axis, and divided all trajectories into 4 categories according to the orientation of the end point of the historical observation trajectory(the first 9 moments):along the upper left Top-left moving, top-right moving, bottom-left moving, and bottom-right moving. The results of the distribution are shown in Figure 6. It can be found that there is a certain gap between the data distribution of the training set and the test set.

In response to the above problems, we did two preprocessing on the training set to improve the consistency of the distribution of the training set and the test set:

- Balanced sampling;
- Regularization of scene data(interpolation of missing track points, centering of track and random rotation).

In addition, for the prediction results, we also made corresponding post-processing operations for trajectory correction, mainly cropping of trajectory points and trajectory selection based on non-maximum suppression. Figure 7 shows the movement area of pedestrians in the two scenes. You can see that there is a clear boundary. For the trajectory beyond the boundary, we have made corresponding corrections to ensure the rationality of the trajectory.

Finally, in training techniques, we also use K-Fold Cross Validation and Grid Search methods to do adaptive parameter tuning. Finally, the performance of FDE 1.24 meters was achieved on the test set, and the FDE method of the second place in the competition was 1.30 meters.

## V. Summary

Pedestrian trajectory prediction is currently a very popular research field. With the participation of more and more scholars and research institutions, the prediction method is also progressively improved and improved. Meituan's unmanned delivery team also looks forward to working with the industry to make more and better solutions in this area. Fortunately, the scene of this competition has a certain similarity to the scene of our Meituan unmanned distribution, so we believe that it can directly empower the business in the future. At present, we have tested the research work in the competition, and also verified the performance of the algorithm, while providing a good support for the algorithm to land in the business.

- References

- [1]Zhang P, Ouyang W, Zhang P, et al. Sr-lstm:State refinement for lstm towards pedestrian trajectory prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019 :12085-12094.
- [2]Gupta A, Johnson J, Fei-Fei L, et al. Social gan:Socially acceptable trajectories with generative adversarial networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018 :2255-2264.
- [3]Sadeghian A, Kosaraju V, Sadeghian A, et al. Sophie:An attentive gan for predicting paths compliant to social and physical constraints[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2019:1349-1358.
- [4]Liang J, Jiang L, Niebles JC, et al. Peeking into the future:Predicting future person activities and locations in videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019:5725-5734.
- [5]Zhu Y, Qian D, Ren D, et al. StarNet:Pedestrian trajectory prediction using deep neural network in star topology[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems .2019:8075-8080.
- [6]Li X, Ying X, Chuah M C. GRIP:Graph-based interaction-aware trajectory prediction[C]//Proceedings of the IEEE Intelligent Transportation Systems Conference. IEEE, 2019:3960-3966.
- [7]Mohamed A, Qian K, Elhoseiny M, et al. Social-STGCNN:A Social spatio-temporal graph convolutional neural network for human trajectory prediction[J]. arXiv preprint arXiv:2002.11927, 2020.
- [8]Huang Y, Bi HK, Li Z, et al. STGAT:Modeling spatial-temporal interactions for human trajectory prediction[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019:6272- 6281.
- [9]Gao J, Sun C, Zhao H, et al. VectorNet:Encoding HD maps and agent dynamics from vectorized representation[J]. arXiv preprint arXiv:2005.04259, 2020.
- [10]Zhu Y, Ren D, Fan M, et al. Robust trajectory forecasting for multiple intelligent agents in dynamic scene[J]. arXiv preprint arXiv:2005.13133, 2020.

- Introduction to the author

- Yan Liang, algorithm engineer of Meituan Autonomous Vehicle Distribution Center.
- Jiahe, a graduate student at Zhejiang University and an intern at Meituan Autonomous Vehicle Distribution Center.
- Deheng, algorithm engineer of Meituan Autonomous Vehicle Distribution Center.
- Dong Chun, algorithm engineer of Meituan's unmanned vehicle distribution center.

**To read more technical articles, please scan the code to follow the WeChat public account-Meituan technical team!**