DARai: Daily Activity Recordings for Artificial Intelligence and Machine learning

Download Data Codes Participants Call Preprint

DARai is a multimodal, multi-view dataset capturing human daily activities in home-like environments, leveraging 20 heterogeneous modalities focused on sensing humans and their surroundings, synchronized with visual data.

The Sensors

The Activities

Our data collection encompassed four primary scenes: living room, office, kitchen, and dining area. Activities within these settings encompass office tasks, entertainment, chores, exercise, rest, and cooking. Subjects were engaged in diverse actions to enable cross-domain activity analysis in authentic home-like environments. Participants transitioned naturally between daily tasks without imposed start commands or strict time constraints. Activities were categorized into control and counterfactual sets, with two recording sessions per environment to introduce variability in task execution.

The Annotations

Example of the lower-level steps in the “Get Ingredients” L2 action from the L1 activity “Making Salad”

Our hierarchical data structure has multi-level of annotations as follow:

Level 1, Activities, represent high-level concepts that are generally recognized as independent tasks.

Level 2, Actions, are recurring patterns found across multiple activities.

Level 3, Procedures, distinguish between different instances of an action, addressing “how” and “what” questions about the actions.

Annotation Format: Local Annotation Path/ {L1 Activity Folder}/ {L1 Activity name}_Sxx_sessionxx_Level_x_Annotations e.g. Writing_S11_session01_Level_2_Annotations

Findings

This experiment evaluates robustness across diverse camera views and examines the impact of different model architectures on performance stability.

Performance Across Same and Different Views

Figure 1: Model performance comparison across different camera views, showing significant drops in ResNet under different views, while Swin-t and MViT-s maintain better robustness.

This highlights the challenges of cross-view recognition and the importance of model selection for robust performance. ResNet struggles with different views, whereas transformer-based models like Swin-tiny and MViT-s demonstrate better generalization, especially with depth cameras.

This experiment evaluates robustness across diverse camera views and examines the impact of different model architectures on performance stability.

Figure 1: Model performance comparison across different camera views, showing significant drops in ResNet under different views, while Swin-t and MViT-s maintain better robustness.

This experiment evaluates how granularity impacts recognition performance.

Impact of Camera Views as Granularity Increases

Figure 1: As task granularity increases from activity to procedure, different camera views capture varying levels of detail. One camera may have a clearer perspective on fine-grained changes, while another might miss them entirely.

These visualizations illustrate the impact of multi-modal data integration on recognition performance. The key focus is on understanding how different sensor modalities contribute to recognition accuracy and how combining them improves robustness.

Figure 1: The impact of multi-modal fusion on accuracy. The best performance is achieved when multiple modalities are combined, demonstrating the importance of sensor integration.

Multi-modal fusion significantly enhances accuracy, particularly when vision-based and wearable sensors are integrated.

These visualizations illustrate the impact of multi-modal data integration on recognition performance across different granularity levels.

Figure 1: Unimodal performance across different sensor modalities, showing varying effectiveness in activity recognition.

Figure 2: Effectiveness of different modalities across granularity levels (Activities → Actions → Procedures). Higher granularity requires more diverse and complementary modalities.

Each modality provides unique but limited information; vision-based data (RGB, Depth) perform well for activities, while wearable sensors contribute to finer-grained recognition.

Folder Structure and File Formats

For easier access, we organized our folder structure with data modalities as the top-level folders. Under each modality, we have folders for activity labels (level 1)

Naming convention: The identifier for sample files follows the format {2-digit subject id}_{session id}.{file format}, e.g., 01_3.csv. Separate sample files are maintained for each view, activity, and data modality.

Since every vision model needs to extract frames from video samples as a preprocessing step, we have already shared extracted frames for faster processing within the community. Thus, in visual data, a 5-digit frame number will be added at the end of the sample identifier. E.g.: 01_03_00000.jpg

Data Modality	RGB	Depth	IR	Depth Confidence	Audio	Timeseries
File Format	jpg	png	png	png	wav	csv

Ethics and Data Collection Process

We obtained approval from an Institutional Review Board (IRB) to conduct this study and collect data from human subjects.

Our recording setup includes two separate environments designed to mimic a home-like living room, home office, kitchen, and dining spaces, with six varying environmental conditions (light, time of day, air conditioning, and background noise).

Every two recording sessions are divided into a control and counterfactual session where subjects perform the same activity with enforced variations, such as moving a heavy box instead of a light box or playing a speed test game versus a reaction test game.

Citation and Usage

BibTeX

@article{kaviani2025hierarchical,
title= {Hierarchical and Multimodal Data for Daily Activity Understanding},
author= {Kaviani, Ghazal and Yarici, Yavuz and Kim, Seulgi and Prabhushankar, Mohit and AlRegib, Ghassan and Solh, Mashhour and Patil, Ameya},
journal= {arXiv preprint arXiv:2504.17696},
year= {2025}}

Contact Us

olives.gatech@gmail.com

Acknowledgments

This work is supported by Amazon Lab126.