FOCAL(Ford – OLIVES Collaboration on Active Learning): A Cost-Aware Video Dataset for Active Learning

Dataset Significance

In this paper, we introduce the FOCAL (Ford-OLIVES Collaboration on Active Learning) dataset which enables the study of the impact of annotation-cost within a video active learning setting. Annotation-cost refers to the time it takes an annotator to label and quality-assure a given video sequence. A practical motivation for active learning research is to minimize annotation-cost by selectively labeling informative samples that will maximize performance within a given budget constraint. However, previous work in video active learning lacks real-time annotation labels for accurately assessing cost minimization and instead operates under the assumption that annotation-cost scales linearly with the amount of data to annotate. This assumption does not take into account a variety of real-world confounding factors that contribute to a nonlinear cost such as the effect of an assistive labeling tool and the variety of interactions within a scene such as occluded objects, weather, and motion of objects. FOCAL addresses this discrepancy by providing real annotation-cost labels for 126 video sequences across 69 unique city scenes with a variety of weather, lighting, and seasonal conditions. These videos have a wide range of interactions that are at the intersection of infrastructure-assisted autonomy and autonomous vehicle communities. We show through a statistical analysis of the FOCAL dataset that cost is more correlated with a variety of factors beyond just the length of a video sequence. We also introduce a set of conformal active learning algorithms that take advantage of the sequential structure of video data in order to achieve a better trade-off between annotation-cost and performance while also reducing floating point operations (FLOPS) overhead by at least 77.67%. We show how these approaches better reflect how annotations on videos are done in practice through a sequence selection framework. We further demonstrate the advantage of these approaches by introducing two performance-cost metrics and show that the best conformal active learning method is cheaper than the best traditional active learning method by 113 hours.

Novel Analyses on Dataset

From this dataset, we can perform an analysis of Active Learning cost from the perspective of cost as well as from performance. In the figure above, we show the amount of labeling time in hours for a wide variety of active learning algorithms.

Dataset Labeling Details

Figure 1. Example of labeling setup within our internal platform.

Figure 2. Example of how cost annotation is calculated within the labeling tool. Users log in on specified dates to work on annotating sequences and their cumulative work time is recorded. This number is summed for all annotators and the total resulting time is the annotation cost of the sequence.

Figure 3. Example of how label and cost information is organized within the FOCAL dataset. Every sequence is assigned a cost value based on the time associated with annotating the sequence.

Within this section, we show specific details surrounding the labeling process. After the data is entered into the online annotation platform, the annotator uses the setup shown in Figure 1 in order to assign bounding boxes to the objects in the scene as well as interpolate these objects across the sequence. Also within this annotation tool is a system to track the amount of time each user spends working on a sequence. This is shown through the example for a specific sequence in Figure 2. The x-axis shows the dates in which work on the sequence was performed and the y-axis indicates the number of cumulative hours across all users for that specific sequence. This work is divided into quality assurance and annotation time and the sum across all hours worked on each date is the final labeling time for the sequence. This can be further understood through the data files available in the dataset. Figure 3 shows how every sequence comes with its own identity as well as information regarding frames, quantity of bounding boxes, unique scene identifier, and the cumulative number of days and hours it took to label the sequence. Participants who annotated the data and performed quality assurance were compensated appropriately for their services.

Dataset Statistics Details

Figure 4. Variation of frame quantity in sequences. The majority of collected sequences consist of a relatively equal number of frames. This avoids potential bias in terms of sequence length in querying strategies.

Frame Quantity Distribution: We show the distribution of frame quantity for all sequences in Figure 4. The majority of sequences are constrained within 594 to 864 frames, indicating that the sequences contain a relatively equal number of frames. This avoids bias in terms of utilizing sequence length as a major indicator for querying. Ideally, active learning algorithms should utilize the inherent information and difficulties with labeling to achieve high generalization and low annotation cost, instead of querying according to the variation of sequence length.

Figure 5. Unique scene identities. All sequences are grouped into 79 unique scenes to obtain non-overlapping training and test scenes. The x-axis represents assigned scene identities. The y-axis shows the number of sequences collected at each corresponding scene.

Figure 6. Season distribution in the FOCAL dataset. The sequences were collected in multiple seasons to encourage diverse environmental conditions and object activities.

Scene Diversity: In order to obtain training and test sequences in non-overlapping scenes, we manually group all sequences into 79 unique scenes according to their geographic locations. We assign each scene a unique identifier. The sequences collected at the same scene are associated with the same scene identifier. The statistics of scene identities are shown in Figure 5. In addition to location diversity, sequences in FOCAL also contain varied environmental conditions. For instance, the sequences were collected in multiple seasons, including winter, spring, and summer, as shown in Figure 6. Overall, there are 41%, 8%, and 51% sequences collected in the winter, spring, and summer, respectively. The variation of these scenes encourages diverse data conditions to evaluate active learning algorithms.

Figure 7. Object instance density. Object diversity is presented in the FOCAL dataset.

Object Diversity: In addition to the variation in the scenes, sequences in FOCAL also contains object diversity. Figure 7 illustrates the statistics of averaged object quantity per frame across all sequences. Due to the variation in locations and environmental conditions, the number of object instances varies across different sequences. This object density variation is beneficial to the evaluation of active learning algorithms.

Smart City Van

Motivation: The smart infrastructure node (smart IX node) was developed to provide a stationary sensing platform that can be deployed in environments where smart vehicles can utilize additional information from an infrastructure perspective to aid in navigating the world around them. When we started to develop algorithms to run on these nodes, we lacked a lot of the standard data that was widely available for autonomous vehicles since this was a new area of work. Namely, we needed annotated data that was representative of an infrastructure perspective, the ability to test and experiment with new sensors at a multitude of public locations, and the ability to experiment with and examine the effects of sensor placement in real-time.
Approach: Since the cost to install these sensors is high, it would have been impossible to install these IX nodes in enough locations to cover all possible weather conditions, times of day, and even types of objects that we could see from an IX node. For autonomous vehicles, this is relatively easy because the vehicle can be driven through any type of environment, weather condition, or location in order to collect representative data that could be used for training. For the IX, this is not as easy, since the platform is stationary. Furthermore, collecting large amounts of data from a limited set of IX nodes with fixed points of view could cause model overfitting to the stationary context (e.g., background) that we captured from these few nodes. We needed something more mobile so that the data can cover various viewpoints, orientations, and locations. To achieve a more mobile data collection process, we could have used a similar sensor mount as on the IX and placed it on a tripod that can be moved and set up at any location. This was feasible but it would not have been easy to move around, set up, power, and collect data from different locations. The mobile van solution provided us the best of both worlds by supplying the same sensor setup as the IX nodes (with the option for testing other sensors as well) along with the mobility of a tripod that’s easier to use and collect data from.

Figure 8. Layout of data collection setup for van.

Figure 10. Sensor layout of cameras and lidar used for data collection.

Van Details: The smart city van was custom built with our internal transit team. The sensors are mounted on the outside on a telescoping mast with pan and tilt controls, providing 360 degrees of horizontal and vertical rotation. The mast can be extended from 6ft to 20 ft. When combined with the vehicle’s height, that provides a height range of 16 ft up to 30 ft. There is also a roof mount AC that was used to cool the interior space as well as the inside electronics. To power all of this, there is a 7.5kW Diesel generator (which shared the fuel tank with the vehicle) on board to provide ample power no matter where the vehicle goes. There is also an option to use power from a standard 120V 15 or 30A outlet when testing in a garage. All of these are detailed in Figure 8. The interior of the shuttle, visualized in Figure 9, provides controls for raising, lowering, and adjusting the mast from within the vehicle as well as two workstations to be able to monitor and collect data, all without getting out of the van. The workstations have a 4G modem/router to provide wired and Wi-Fi connectivity and there is a server rack for easily connecting and removing hard disks from the onboard computer. Furthermore, there is an alarm to warn if the vehicle goes out of the park while the mast is still raised. There is also an electronics panel with controls for the AC system, DC/battery system, and the generator. On top of the mast is the sensor setup with the camera system utilized for data collection in Figure 10.

Access Code and Dataset

Code: https://github.com/olivesgatech/FOCAL_Dataset

Dataset: https://zenodo.org/records/10391061

Arxiv: https://arxiv.org/pdf/2311.10591