From Videos to Vectors: Holistic Encoding and New Interface for Exploration and Discovery

1. Introduction

As advancements in media and technology reshape our world, the scope of cultural heritage preservation now includes videos, from social media clips to comprehensive recordings (Caswell et al., 2017; Pietrobruno, 2013). These visual narratives are crucial for preserving contemporary culture, yet accessing and navigating them poses significant challenges (Edmondson, 2016). The volume and complexity of video archives necessitate innovative solutions for effective management and insight extraction.

Recent innovations in video annotation have introduced more intricate and automated processes, incorporating elements like color, entities, and visual composition (Evi-Colombo et al., 2020). These advancements enable deeper exploration beyond simple text searches. Additionally, video archive interfaces have transformed, moving from traditional search bars to more explorative visualization interfaces, such as clustering items based on visual similarity or color (Piet, 2023; Kenderdine et al., 2021; Masson, 2020; Ohm et al., 2023).

Nevertheless, the current trends in annotation and interface design, although promising, frequently hinge on techniques and methodologies originally devised for images. These methods encompass such as visual similarity based on ImageNet, colour analysis, as well as the utilization of t-SNE versus UMAP and clustering. This limitation prompts a critical need for approaches that are tailored specifically to the unique characteristics of video data.

This proposal addresses this gap by delving into both video annotation, the process of datafying videos, and the interface, which serves as the gateway to video archives.

2. Encoding Videos as Videos

Traditional video archive annotations have focused on metadata like titles, durations, keywords, and key individuals (Williams and Bell, 2021). Some approaches also consider features like average color and composition, but these primarily rely on still frames, overlooking the multimodal and temporal nature of videos. To address this, our work proposes a text-video representation approach, creating joint embeddings for text descriptions and video content. This method, using a dual-encoder mechanism, encodes both video and text into high-dimensional vectors in the same latent space, allowing for holistic annotation and searchability (Rohrbach, 2016).

This approach, utilizing techniques like bi-directional max-margin ranking loss and symmetric cross-entropy loss, enables a single vector to encapsulate various aspects of a video, including color, movement, key individuals, conversations, and environmental sounds. Recognizing the limitations of current datasets like MSR-VTT, which offer plain visual descriptions, our work aims to design a robust pipeline for generating detailed and holistic video descriptions. By leveraging state-of-the-art multimodal generative tools, we aim to generate and encode detailed descriptions for videos in the RTS archive, ensuring effective and nuanced representation.

3. Using Vectors for Innovative Interfaces

This segment of our work focuses on presenting minimum viable products that exemplify diverse trajectories for new interfaces, made possible by obtaining a holistic vector representation of videos. It's crucial to note that the examples aim to illustrate ways for the public to engage more effectively with the massive volume of archival data. The emphasis here is on exploration and discovery, fostering a deeper connection with the archival content, rather than merely searching for a specific video with precision.

3.1. Semantic Latent Space Explorer

Demo: This interface introduces a latent space explorer using newly acquired vectors. Utilizing UMAP to reduce dimensions to three, it enables public exploration of videos in a 3D space.

Innovation: Unlike previous interfaces that relied on single features like visual similarity or color, this approach uses rich semantic descriptions including plot, characters, and conversations. This allows for a holistic understanding of the archives and serendipitous discoveries.

3.2. Your Story Elsewhere

Demo: Leveraging joint embeddings of video and text, this interface allows nuanced retrievals of video content. Using the same encoder for arbitrary text input, a neural network searches for video clips that best match user-provided descriptions. Users can share memorable moments or stories, which the interface segments into sentences and queries the archive. The result is an edited video comprising clips that align with the user's narrative.

Innovation: Traditional archive interfaces limit access due to biases from the interface, user preferences, and curatorial decisions. This innovative approach lets users discover random content relatable to their experiences, enhancing public engagement with the archive and maximizing its cultural value.

Appendix A

Bibliography

Caswell, M., Harter, C. and Jules, B., 2017. Diversifying the digital historical record: integrating community archives in national strategies for access to digital cultural heritage. D-Lib Magazine, 23(5/6), pp.1-7.
Pietrobruno, S., 2013. YouTube and the social archiving of intangible heritage. new media & society, 15(8), pp.1259-1276.
Edmondson, Ray. 2016. Audiovisual Archiving: Philosophy and Principles. Unesco Paris.
Evi-Colombo, A., Cattaneo, A. and Bétrancourt, M., 2020. Technical and pedagogical affordances of video annotation: A literature review. Journal of Educational Multimedia and Hypermedia, 29(3), pp.193-226.
Piet, N, 2023. Beyond Search. Netherlands Institute for Sound & Vision.
Kenderdine, Sarah, Ingrid Mason, and Lily Hibberd. 2021. ‘Computational Archives for Experimental Museology’. In , 3–18. Springer
Masson, Eef, Christian Gosvig Olesen, Nanne van Noord, and Giovanna Fossati. 2020. ‘Exploring Digitised Moving Image Collections: The SEMIA Project, Visual Analysis and the Turn to Abstraction.’ DHQ: Digital Humanities Quarterly, no. 4.
Ohm, T., Canet Sola, M., Karjus, A. and Schich, M., 2023, September. Collection space navigator: an interactive visualization interface for multidimensional datasets. In Proceedings of the 16th International Symposium on Visual Information Communication and Interaction (pp. 1-5).
Williams, M. and Bell, J., 2021. The Media Ecology Project: Collaborative DH Synergies to Produce New Research in Visual Culture History. DHQ: Digital Humanities Quarterly, 15(1).
Karpathy, Andrej, Armand Joulin, and Li Fei-Fei. 2014. ‘Deep Fragment Embeddings for Bidirectional Image Sentence Mapping’. https://doi.org/10.48550/ARXIV.1406.5679 .
Sohn, Kihyuk. 2016. ‘Improved Deep Metric Learning with Multi-Class N-Pair Loss Objective’. In Advances in Neural Information Processing Systems, edited by D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett. Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf .
Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. 2016. ‘MSR-VTT: A Large Video Description Dataset for Bridging Video and Language’. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5288–96. Las Vegas, NV, USA: IEEE. https://doi.org/10.1109/CVPR.2016.571 .
Seo, P.H., Nagrani, A., Arnab, A. and Schmid, C., 2022. End-to-end generative pretraining for multimodal video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 17959-17968).
Liu, Y., Han, T., Ma, S., Zhang, J., Yang, Y., Tian, J., He, H., Li, A., He, M., Liu, Z. and Wu, Z., 2023. Summary of chatgpt-related research and perspective towards the future of large language models. Meta-Radiology, p.100017.