In Stanford haben sie einen Algorithmus entwickelt, der die Bearbeitung von Videos erlaubt, in dem man den Text des Dialogs ändert. Der Algo beschränkt sich derzeit auf „Talking Head“-Videos, also Clips von Leuten die in die Kamera labern. Die Technologie editiert gleichzeitig Audio und Video, ist also eine hervorragende Methode, um Film-Dialoge in Post-Production zu ändern, oder eben um Karrenbauer, Trump oder Youtubern in Post-Truth alle möglichen Worte in den Mund zu legen weil nothing is real.
Should an actor or performer flub a word or misspeak, the editor can simply edit the transcript and the application will assemble the right word from various words or portions of words spoken elsewhere in the video. It’s the equivalent of rewriting with video, much like a writer retypes a misspelled or unfit word. The algorithm does require at least 40 minutes of original video as input, however, so it won’t yet work with just any video sequence.
As the transcript is edited, the algorithm selects segments from elsewhere in the recorded video with motion that can be stitched to produce the new material. In their raw form these video segments would have jarring jump cuts and other visual flaws.
To make the video appear more natural, the algorithm applies intelligent smoothing to the motion parameters and renders a 3D animated version of the desired result. However, that rendered face is still far from realistic. As a final step, a machine learning technique called Neural Rendering converts the low-fidelity digital model into a photorealistic video in perfect lip-synch.
Our text-based editing approach lays the foundation for better editing tools for movie post production. Filmed dialogue scenes often require re-timing or editing based on small script changes, which currently requires tedious manual work. Our editing technique also enables easy adaptation of audio-visual video content to specific target audiences: e.g., instruction videos can be fine-tuned to audiences of different backgrounds, or a storyteller video can be adapted to children of different age groups purely based on textual script edits. In short, our work was developed for storytelling purposes.
However, the availability of such technology — at a quality that some might find indistinguishable from source material — also raises important and valid concerns about the potential for misuse. Although methods for image and video manipulation are as old as the media themselves, the risks of abuse are heightened when applied to a mode of communication that is sometimes considered to be authoritative evidence of thoughts and intents. We acknowledge that bad actors might use such technologies to falsify personal statements and slander prominent individuals. We are concerned about such deception and misuse.
Therefore, we believe it is critical that video synthesized using our tool clearly presents itself as synthetic. The fact that the video is synthesized may be obvious by context (e.g. if the audience understands they are watching a fictional movie), directly stated in the video or signaled via watermarking. We also believe that it is essential to obtain permission from the performers for any alteration before sharing a resulting video with a broad audience. Finally, it is important that we as a community continue to develop forensics, fingerprinting and verification techniques (digital and non-digital) to identify manipulated video. Such safeguarding measures would reduce the potential for misuse while allowing creative uses of video editing technologies like ours.