The State of Video LLMs

Some context: I moved on from the Talent-space (which I think will face a number of challenges, even though LLMs should improve some critical problems) and am now working on ReelBank, connecting creators with model trainers who need video data.

While I've been playing with LLM technology for over a year (my last startup pivoted into the LLM space before shutting down), this is my first foray into the Video LLM space, which I think has huge potential.

The below is a rehash of a tweet-thread I posted to my Twitter account. Looking forward to sharing more learnings as the space and my understanding develoop.

As video becomes the newest and most data-intensive application of AI, it's important to understand its current uses with large language models (LLMs). Here’s an overview of where things stand:

Current Applications of Video with LLMs

  1. Text to Video: Tools like RunwayML and Sora allow you to input text and generate video content. It's an exciting development, although still in its early stages. There are many entrants in this space, both large and small.
  2. Video Avatars: Platforms like HeyGen enable the creation of avatars that can recite text or interact, combining source imagery with models to produce realistic talking head videos. There are many startups in this space, most are pretty good.
  3. Multimodal Applications: Devices like Meta's glasses use video to help models interpret real-world scenes, enhancing their practical utility in both consumer and enterprise spaces. Apple Intelligence will also lean heavily in multimodal models to power understanding of environments for users.
  4. Storytelling: The most nascent application involves constructing narratives across multiple scenes. While this is still in its infancy, some day this will enable generating feature-length films, competing with or augmenting Hollywood.

Where Are We Now?

  • Text to Video: Despite promising developments, most of these models struggle with physics, limiting their ability to produce seamless, realistic videos. However, models from Sora and Kuaishou are showing that good quality 60-second videos with realistic physical interactions are becoming possible.
  • Video Avatars: These are the most mature among the current applications. Smaller models, which I believe leverage some other methods outside of LLMs, can now create highly realistic avatars , as demonstrated by Tavus and HeyGen.
  • Multimodal Applications: These remain more obscure. While not always immediately apparent, video plays a crucial role in powering mdoels to interpret what the user is experiencing. For the enterprise companies like Cohere are leading the way in leveraging different industries to improve training and output for employees. For consumer expect tech giants like Google, Apple, Meta, and Snap to also release products with broad applications that improve certain product experiences seemlessly.
  • Storytelling and Movie Production: This is the most complex task, requiring the generation of consistent and coherent long scenes. Although challenging, I expect that Sora will eventually move into the space given their partnership with creatives.

Why is Video Harder than Text or Image?

  1. Character Consistency: While advanced models have made shown this is possible, startups often struggle with maintaining consistent characters across frames.
  2. Training Costs: Video data is expensive to train. Each frame equates to about 8100 tokens, and a minute of video translates to 14 million tokens. This makes even small, use-case specific models costly to train.
  3. Physics: Unlike static images, video requires modeling the change in images frame-by-frame, understanding depth and movement in 2.5/3D. Oxen has a great breakdown of some of the approaches that models like Sora's are likely taking to achieve better understanding of physics in video models. Most models produced by smaller startups fail to generate videos with interactions consistent with the real world.
  4. Inference Costs: Similar to training, generating video is token-intensive and expensive. Cost scales linearly per second, so most models only produce 4-8 seconds of video partially due to compute limitations and costs.

Potential Solutions

While systemic hardware improvements will eventually make video AI products more viable, there are several key areas to focus on in the meantime:

  1. Data Curation: Maximizing the value of each token means carefully selecting the clips and frames you use. It’s essential to replace subpar clips with better ones and only take in data that adds to the understanding of the model.
  2. Pairings and Labeling: Improved labeling can reduce the volume of video data needed. Recent papers (like this one) highlight the importance of effective video data labeling. Additional context, like storyboards and scripts, can enhance the value of video for a model.
  3. Data Acquisition: Finding the right data is challenging. With an overwhelming amount of content available on platforms like YouTube, curating the perfect clips to train models is critical. There is also the open question of data providence. While most startups are not deterred today from utilizing Youtube video to train their models (OpenAI has stated they believe this is fair use), this could change has more content creators defend their copyrights.
  4. Mixture of Experts: Video AI will likely depend on numerous specialized models, each excelling in specific areas. These smaller models are easier to re-train, fine-tune and run efficiently.

As video continues to evolve as an AI modality, I expect that players like OpenAI will continue to deliver models with. exceptional capabilities. Competitors, who have more resource constraints, will need to do more with less and be thoughtful about how their acquire and curate their data as well as solving architectural problems to better model physics.

If you found this interesting, we're solving the data identification and acquisition problem at ReelBank.  Check us out.

Trent Krupp

Co-Founder of Troveo.ai, connecting creators with the AI economy. Previously, Head of Product at Impact, a market network serving the entertainment industry as well as Head of Revenue at Triplebyte and Hired. Founded an agency in my 20's, sold it to Hired and became employee 5. Recruited for VCs, growth and public companies. Helped the founders of recruitment tech startups Shift.org, Trusted Health, Terminal and Beacon in the early days.