Evaluation Metrics

  • F1 Score (Code, threshold=0.5)
  • Running time (Code, please limit the maximum consumption of GPU memory to 10G and RAM to 28GB)

The testing images will be evaluated one by one. To compensate for the Docker container startup time, we give a time tolerance for the running time. Specifically, the time tolerance is 10s if the image size (height H x width W) is no more than 1,000,000. If the image size is more than 1,000,000, the time tolerance is (HxW)/1000000x10s.


In other words, if the Docker container elapsed time is within the time tolerance for each image, you will get a perfect score in running time. Here is the time tolerance for the validation images. 

2023.08.13 update: We also present the F1 scores at other thresholds (0.6, 0.7, 0.8, 0.9) on the leaderboard. 

Make testing docker submissions

To avoid overfitting the testing set, we only offer one successful submission opportunity on the testing set. The submission should include a Docker container (teamname.tar.gz), a sanity test video (demo), and a methodology paper. 


Evaluation Platform

The submitted docker containers will be evaluated on a Ubuntu 20.04 desktop. Detailed information is listed as follows:

  • CPU: Intel® Xeon(R) W-2133 CPU @ 3.60GHz × 12
  • GPU: NVIDIA 2080Ti (Available memory 10 G)
  • RAM: 32G (28G available)
  • Driver Version: 510.60.02   
  • CUDA Version: 11.6
  • Docker version 20.10.13

Ranking Scheme

Both F1 score and running time are used in the ranking scheme. However, the two metrics cannot be directly fused because they have different dimensions. Thus, we use a “rank-then-aggregate" scheme for ranking, including the following three steps:

  • Step 1. Computing the two metrics for each testing case and each team;
  • Step 2. Ranking teams for each of the N testing cases such that each team obtains Nx2 rankings;
  • Step 3. Computing ranking scores for all teams by averaging all these rankings and then normalizing them by the number of teams.