Sora, developed by OpenAI, is a novel model capable of creating videos up to one minute long from descriptive prompts. This capability marks a promising advancement in the field, enabling the generation of complex scenes involving multiple characters, specific movements, and precise details about subjects and backgrounds. Sora stands out for its language understanding, allowing it to grasp a wide range of instructions to create captivating videos, with characters expressing nuanced emotions and diverse sequences enriching the narrative. This technology aims to ensure consistency in visual style and character authenticity, reinforcing the impact and coherence of each production. However, this consistency is not always guaranteed. Like other AIs dedicated to image generation, spatial coherence discrepancies, adherence to physical laws, or texture issues may arise. The specific nature of the generated content can also lead to temporal anomalies, such as a partially consumed biscuit reappearing intact later. OpenAI is not the pioneer in developing models capable of producing video sequences. Other companies are also working on similar projects. Google has announced Lumiere, while Meta plans to launch Emu, although these models are not yet available to third parties. Additionally, Runway launched Gen-2 last June, another model capable of generating short-duration video sequences, highlighting the growing competition and innovation in the AI-driven video content generation sector. OpenAI's major asset lies in the user interface they provide, similar to ChatGPT, giving them a pioneering advantage in leveraging their model.
Dissecting Sora: Technical Decryption of Video Generation
Sora represents a significant advancement in the field of visual content generation. It relies on a methodology of training diffusion models guided by prompts to process videos and images with varying durations, resolutions, and aspect ratios. The underlying assumption behind the model's construction is that Large Language Models (LLMs), when trained on large volumes of data, can unify multiple modes of text (different languages, natural language, code, mathematical equations, etc.). The authors hypothesize that visual data can inherit these benefits, not through tokens as in natural language, but through "patches." A patch refers to a small block or region of an image that includes multiple pixels. These patches efficiently decompose images for processing by the tool during training and inference. A transformer, a type of model also used in LLMs, captures user instructions (text, videos, images, etc.) and processes them to build a coherent framework for the video. This framework, akin to a screenplay, is then passed to a diffusion model. These models generate realistic images from noisy images (where all pixels have random values) and text, and are currently the most promising for image generation. The role of the diffusion model is to create the necessary patches for the video, following the transformer's instructions. All patches are then assembled to generate the expected video. In summary, the
transformer handles processing user instructions and constructing a coherent storyline, while the diffusion model is responsible for creating images according to the transformer's command. Text-to-video generation in Sora also benefits from a recaptioning technique similar to that used for DALL-E 3. A highly descriptive model, GPT in Sora’s case, enhances the user prompt with more detail, and therefore improves the overall video quality. This approach also enriches Sora's training, enabling it to create content suitable for various formats and native resolutions, and providing great sampling flexibility and significant improvement in video framing and composition.
Sora innovates not only with its approach to processing visual data via "patches" but also by training on images at their native resolution without resizing. This approach paves the way for better adaptation of content to the characteristics of various devices while enhancing the quality and relevance of created content, opening the door to a wide range of applications.
Practical Applications of SORA: Impacts on Audiovisual, Education, and Digital Communication
Sora represents a promising advancement for various practical applications. In the audiovisual industry, its potential for accelerated development of scene prototypes and visual effects allows film and television studios to explore different visual concepts before moving to actual production, promising cost reduction and optimization of design timelines. In terms of training and education, this tool could revolutionize the creation of learning materials through the generation of realistic simulations useful in various contexts. Applications could include surgical preparation, improving understanding of traffic rules for learner drivers, or enhancing museum experiences with historical reconstructions and interactive demonstrations, thereby improving public knowledge assimilation. In the communication sector, using this tool could boost engagement on social networks by producing visually appealing content, reinforcing the reach and effectiveness of digital communication strategies.
Between Innovation and Controversy, the Ethical Challenges of AI Video
Sora also presents drawbacks similar to other generative AIs, and it is important to mention them. The primary concern is the creation of videos for malicious purposes. A user could generate and disseminate particularly violent videos to shock an audience. Similarly, as misinformation campaigns are regularly updated, the automatic generation of videos could become a preferred tool for creating deepfakes aimed at damaging the reputation and credibility of individuals or institutions. Fortunately, in a preventive approach against potential abuses from its launch, OpenAI has chosen to restrict the use of the interface to specialists in key areas such as combating misinformation, hateful content, and bias reduction. This strategy aims to subject the model to rigorous testing through adversarial attacks, allowing identification and correction of vulnerabilities and implementation of preventive filters to restrict sensitive content production by the model. On the other hand, like other generative AI tools, Sora could have a significant environmental impact. We know that the training and inference of models behind ChatGPT are major energy consumers and thus emit CO2. The model training behind Sora could have similar environmental downsides, adding to the consumption related to the storage and exchange of generated data, as videos are larger files than images or text.
Sora Under Lock: Restricted Access to Sora
Currently, access to Sora is limited and not open to all users. As detailed in the previous section, the tool is still in a testing phase, especially to strengthen its security. Unfortunately, no public opening date has been announced yet. Patience will be required to fully harness the power of this innovative tool.
Authors : Dr Axel Journe et Dr Alexandra Benamar