Author | Lightnews

Reposted by Hazel Doughty

Antonino Furnari @antoninofurnari.bsky.social · Jun 11

Tomorrow, I’ll give a talk about future predictions in egocentric vision at the #CVPR2025 precognition workshop, in room 107A at 4pm.

I’ll retrace some history and show how precognition enables assistive downstream tasks and representation learning for procedural understanding.

1 5

Hazel Doughty @hazeldoughty.bsky.social · Jun 11

Excited to be giving a keynote at the #CVPR2025 Workshop on Interactive Video Search and Exploration (IViSE) tomorrow. I'll be sharing our efforts working towards detailed video understanding.
📅 09:45 Thursday 12th June
📍 208 A
👉 sites.google.com/view/ivise2025

ivise-workshop.github.io

1 4

Reposted by Hazel Doughty

Dima Damen @dimadamen.bsky.social · Jun 10

Have you heard about HD-EPIC?
Attending #CVPR2025
Multiple opportunities to know about the most highly-detailed video dataset with a digital twin, long-term object tracks, VQA,…
hd-epic.github.io

1. Find any of the 10 authors attending @cvprconference.bsky.social
– identified by this badge.

🧵

1 3 5

Reposted by Hazel Doughty

Dima Damen @dimadamen.bsky.social · May 6

Do you want to prove your Video-Language Model understands fine-grained, long-video, 3D world or anticipates interactions?
Be the 🥇st to win HD-EPIC VQA challenge
hd-epic.github.io/index#vqa-be...
DL 19 May
Winners announced @cvprconference.bsky.social #EgoVis workshop

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

A Highly-Detailed Egocentric Video Dataset

hd-epic.github.io

3 8

Reposted by Hazel Doughty

Dima Damen @dimadamen.bsky.social · Apr 3

Object masks &tracks for HD-EPIC have been released.. This completes our highly-detailed annotations.

Also, HD-EPIC VQA challenge is open [Leaderboard closes 19 May]... can you be 1st winner?
codalab.lisn.upsaclay.fr/competitions...

Btw, HD-EPIC was accepted @cvprconference.bsky.social #CVPR2025

Dima Damen @dimadamen.bsky.social · Feb 7

🛑📢
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
hd-epic.github.io
arxiv.org/abs/2502.04144
New collected videos
263 annotations/min: recipe, nutrition, actions, sounds, 3D object movement &fixture associations, masks.
26K VQA benchmark to challenge current VLMs
1/N

4 10

Hazel Doughty @hazeldoughty.bsky.social · Mar 5

The HD-EPIC VQA challenge for CVPR 2025 is now live: codalab.lisn.upsaclay.fr/competitions...

See how your model stacks up against Gemini and LLaVA Video on a wide range of video understanding tasks.

Reposted by Hazel Doughty

#CVPR2026 @cvprconference.bsky.social · Feb 28

#CVPR2025 PRO TIP: To get a discount on your registration, join the Computer Vision Foundation (CVF). It’s FREE and makes @wjscheirer smile 😉

CVF: thecvf.com

6 14

Reposted by Hazel Doughty

Diane Larlus @dlarlus.bsky.social · Feb 7

HD-EPIC - hd-epic.github.io
Egocentric videos 👩‍🍳 with very rich annotations: the perfect testbed for many egocentric vision tasks 👌

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

📢 Today we're releasing a new highly detailed dataset for video understanding: HD-EPIC

arxiv.org/abs/2502.04144

hd-epic.github.io

What makes the dataset unique is the vast detail contained in the annotations with 263 annotations per minute over 41 hours of video.

2 6

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

This was a monumental effort from a large team across Bristol, Leiden Singapore and Bath.

The VQA benchmark only scratches the surfaces of what is possible to evaluate with this detail of annotations.

Check out the website if you want to know more: hd-epic.github.io

https://hd-epic.github.io/

t.co

1

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

VQA Benchmark

Our benchmark tests understanding in recipes, ingredients, nutrition, fine-grained actions, 3D perception, object movement and gaze. Current models have a long way to go with a best performance of 38% vs. 90% human baseline.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

Scene & Object Movements

We reconstruct participants kitchens and annotate every time an object is moved.

1 2

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

Fine-grained Actions

Every action has a dense description not only describing what happens in detail, but also how and why it happens.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

As well as annotating temporal segments corresponding to each step we also annotate all the preparation needed to complete each step.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

Recipe & Nutrition

We collect details of all the recipes participants chose to perform over 3 days in their own kitchen. Alongside ingredient weights and nutrition.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Feb 7

📢 Today we're releasing a new highly detailed dataset for video understanding: HD-EPIC

arxiv.org/abs/2502.04144

hd-epic.github.io

What makes the dataset unique is the vast detail contained in the annotations with 263 annotations per minute over 41 hours of video.

1 4 16

Reposted by Hazel Doughty

Dima Damen @dimadamen.bsky.social · Feb 7

🛑📢
HD-EPIC: A Highly-Detailed Egocentric Video Dataset
hd-epic.github.io
arxiv.org/abs/2502.04144
New collected videos
263 annotations/min: recipe, nutrition, actions, sounds, 3D object movement &fixture associations, masks.
26K VQA benchmark to challenge current VLMs
1/N

2 6 33

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

We propose a simple baseline using phrase-level negatives and visual prompting to balance coarse- and fine-grained performance. This can easily combined with existing approaches. However, there is much potential for future work.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

Incorporating fine-grained negatives into training does improve fine-grained performance, however it comes at the cost of coarse-grained performance.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

We use this evaluation to investigate current models and find they lack fine-grained understanding, particularly for adverbs and prepositions.

We also see that good coarse-grained performance does not necessarily indicate good fine-grained performance.

1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

We propose a new fine-grained evaluation approach which analyses a model's sensitivity to individual word variations in different parts-of-speech.

Our approach automatically creates new fine-grained negative captions and can be applied to any existing dataset.

1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

Current video-text retrieval benchmarks focus on coarse-grained differences as they focus on distinguishing the correct caption from captions of other, often irrelevant videos.

Captions thus rarely differ by a single word or concept.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

Our second #ACCV2024 oral: "Beyond Coarse-Grained Matching in Video-Text Retrieval" is also being presented today.

ArXiv: arxiv.org/abs/2410.12407

We go beyond coarse-grained retrieval and explore whether models can discern subtle single-word differences in captions.

1 1 3

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

ArXiv: arxiv.org/abs/2410.12018

Drop by the poster this afternoon to chat to Fida

LocoMotion: Learning Motion-Focused Video-Language Representations

This paper strives for motion-focused video-language representations. Existing methods to learn video-language representations use spatial-focused data, where identifying the objects and scene is ofte...

arxiv.org

1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

Training a model on the these video-text pairs results in a representation that is beneficial to motion-focused downstream tasks, particularly when little data is available for finetuning.

1 1

Hazel Doughty @hazeldoughty.bsky.social · Dec 10

Since we know how our synthetic motions have been generated we can also generate captions to describe them using pre-defined phrases. We then diversify the vocabulary and structure of our descriptions with our verb-variation paraphrasing.

1