As advancements in media and technology reshape our world, the scope of cultural heritage preservation now includes videos, from social media clips to comprehensive recordings (Caswell et al., 2017; Pietrobruno, 2013). These visual narratives are crucial for preserving contemporary culture, yet accessing and navigating them poses significant challenges (Edmondson, 2016). The volume and complexity of video archives necessitate innovative solutions for effective management and insight extraction.
Recent innovations in video annotation have introduced more intricate and automated processes, incorporating elements like color, entities, and visual composition (Evi-Colombo et al., 2020). These advancements enable deeper exploration beyond simple text searches. Additionally, video archive interfaces have transformed, moving from traditional search bars to more explorative visualization interfaces, such as clustering items based on visual similarity or color (Piet, 2023; Kenderdine et al., 2021; Masson, 2020; Ohm et al., 2023).
Nevertheless, the current trends in annotation and interface design, although promising, frequently hinge on techniques and methodologies originally devised for images. These methods encompass such as visual similarity based on ImageNet, colour analysis, as well as the utilization of t-SNE versus UMAP and clustering. This limitation prompts a critical need for approaches that are tailored specifically to the unique characteristics of video data.
This proposal addresses this gap by delving into both video annotation, the process of datafying videos, and the interface, which serves as the gateway to video archives.
Traditional video archive annotations have focused on metadata like titles, durations, keywords, and key individuals (Williams and Bell, 2021). Some approaches also consider features like average color and composition, but these primarily rely on still frames, overlooking the multimodal and temporal nature of videos. To address this, our work proposes a text-video representation approach, creating joint embeddings for text descriptions and video content. This method, using a dual-encoder mechanism, encodes both video and text into high-dimensional vectors in the same latent space, allowing for holistic annotation and searchability (Rohrbach, 2016).
This approach, utilizing techniques like bi-directional max-margin ranking loss and symmetric cross-entropy loss, enables a single vector to encapsulate various aspects of a video, including color, movement, key individuals, conversations, and environmental sounds. Recognizing the limitations of current datasets like MSR-VTT, which offer plain visual descriptions, our work aims to design a robust pipeline for generating detailed and holistic video descriptions. By leveraging state-of-the-art multimodal generative tools, we aim to generate and encode detailed descriptions for videos in the RTS archive, ensuring effective and nuanced representation.
This segment of our work focuses on presenting minimum viable products that exemplify diverse trajectories for new interfaces, made possible by obtaining a holistic vector representation of videos. It's crucial to note that the examples aim to illustrate ways for the public to engage more effectively with the massive volume of archival data. The emphasis here is on exploration and discovery, fostering a deeper connection with the archival content, rather than merely searching for a specific video with precision.
Demo: This interface introduces a latent space explorer using newly acquired vectors. Utilizing UMAP to reduce dimensions to three, it enables public exploration of videos in a 3D space.
Innovation: Unlike previous interfaces that relied on single features like visual similarity or color, this approach uses rich semantic descriptions including plot, characters, and conversations. This allows for a holistic understanding of the archives and serendipitous discoveries.
Demo: Leveraging joint embeddings of video and text, this interface allows nuanced retrievals of video content. Using the same encoder for arbitrary text input, a neural network searches for video clips that best match user-provided descriptions. Users can share memorable moments or stories, which the interface segments into sentences and queries the archive. The result is an edited video comprising clips that align with the user's narrative.
Innovation: Traditional archive interfaces limit access due to biases from the interface, user preferences, and curatorial decisions. This innovative approach lets users discover random content relatable to their experiences, enhancing public engagement with the archive and maximizing its cultural value.
In conclusion, this work aims to advance the practice of datafication in video archives from the perspectives of video annotation and access interfaces. Drawing from real-world archive content provided by our partner RTS, we present innovative examples that transcend traditional methods, offering a holistic and tailored approach to navigate and unlock the immense potential within video archives.