Video Processing

A Disney story is often told through video, whether it’s a movie, a serial, a newscast, or professional sports. This raises a gamut of research challenges with hard-hitting economic impact: for example, automating labor-intensive processes while preserving art directability, avoiding expensive reshoots by adding content-aware flexibility in postproduction, and adapting to a world with increasingly diverse devices. Video technology, from capture to display, is of utmost importance for many Disney business units. Therefore, the Advanced Video Technology group addresses these needs in leading edge research. This spans the whole range from conventional 2D video, over 3D video, stereoscopic and beyond, to most sophisticated free viewpoint video representations. Camera systems are designed, which take signal acquisition to the next level. A particular focus is on any type of subsequent processing, automatic and interactive, real-time and post-production to create highest quality images.


(in alphabetical order)

Automatic Editing of Footage from Multiple Social Cameras
We present an approach that takes multiple videos captured by social cameras that are carried or worn by members of the group involved in an activity—and produces a coherent “cut” video of the activity. Footage from social cameras contains an intimate, personalized view that reflects the part of an event that was of importance to the camera operator (or wearer). We leverage the insight that social cameras share the focus of attention of the people carrying them. We use this insight to determine where the important “content” in a scene is taking place, and use it in conjunction with cinematographic guidelines to select which cameras to cut to and to determine the timing of those cuts. A trellis graph formulation is used to optimize an objective function that maximizes coverage of the important content in the scene, while respecting cinematographic guidelines such as the 180-degree rule and avoiding jump cuts. We demonstrate cuts of the videos in various styles and lengths for a number of scenarios, including sports games, street performance, family activities, and social get-togethers. We evaluate our results through an in-depth analysis of the cuts in the resulting videos and through comparison with videos produced by a professional editor and existing commercial solutions.

Automatic View Synthesis by Image-Domain-Warping
Today, stereoscopic 3D (S3D) cinema is already mainstream, and almost all new display devices for the home support S3D content. S3D distribution infrastructure to the home is partly already established in form of 3D Blu-ray discs, video on demand services, or television channels. However, the necessity to wear glasses is often considered as an obstacle, which hinders broader acceptance of this technology in the home. Multiview autostereoscopic displays enable a glasses free perception of S3D content for several observers simultaneously, and support head motion parallax in a limited range. In order to support multiview autostereoscopic dispays in an already established S3D distribution infrastructure, a synthesis of new views from S3D video is needed. In this paper, a view synthesis method based on Image-domain-Warping (IDW) is presented which synthesizes new views directly from S3D video and functions completely automatically. IDW relies on an automatic and robust estimation of sparse disparities and image saliency information, and enforces target disparities in synthesized images using an image warping framework. Two configurations of the view synthesizer in the scope of a transmission and view synthesis framework are analyzed and evaluated. A transmission and view synthesis system that uses IDW was recently submitted to MPEG’s call for proposals on 3D Video Technology, where it was ranked among the four best performing proposals.

Cache-Efficient Graph Cuts on Structured Grids
Minimal cuts on graphs have been studied over decades and have developed into a fundamental solution to problems in various disciplines. In computer vision and graphics, applications range from segmentation, stereo and shape reconstruction, editing and synthesis, fitting and registration to pose estimation and more.

Typically, in those application domains the underlying graph structures can be characterized as regular N-D grids, where all nodes have topologically identical neighborhood systems, i.e., each node is connected in a uniform fashion to all other nodes lying within a given radius. However, computation speed and memory consumption often times limit the effective use in applications requiring high resolution grids or interactive response. In particular, memory bandwidth represents one of the major bottlenecks even in today's most efficient implementations. We propose a compact data structure with cache-efficient memory layout for the representation of graph instances that are based on regular N-D grids with topologically identical neighborhood systems. For this common class of graphs, our data structure allows for 3 to 12 times higher grid resolutions and a 3- to 9-fold speedup compared to existing approaches. Our design is agnostic to the underlying algorithm, and hence orthogonal to other optimizations such as parallel and hierarchical processing.

In video-related applications such as segmentation, colorization, and stereo reconstruction, the developed technology enables much more efficient processing of high resolution content

Color Tracking and Balancing for Augmented Reality
The colors that are ultimately recorded by both consumer-grade and high-end digital cameras depend on a plethora of factors and are influenced by complex hardware and software processing algorithms, making the precise color quality of captured images difficult to predict and control. As a result, post-processing algorithms are required to fix color-related issues. This problem is generally known as color balancing or color grading, and plays a central role in all areas involving capturing, processing and displaying image data. In this project, we investigate various real-time color balancing algorithms that are able to globally adapt the colors of a source image or video to match a desired target look.

This software is available as a Nuke plugin and as standalone software

Computational Stereo Camera System with Programmable Control Loop
The entertainment industry is steadily moving towards stereoscopic 3D (S3D) movie production, and the number of movie titles released in S3D is continuously increasing. The production of stereoscopic movies, however, is more demanding than traditional movies, as S3D relies on a sensitive illusion created by projecting two different images to the viewer’s eyes. It therefore requires proper attention to achieve a pleasant depth experience. Any imperfections, especially when accumulated over time, can cause wrong depth perception and adverse effects such as eye strain, fatigue, or even motion sickness. The main difficulty of S3D is the complex interplay of human perception, 3D display properties, and content composition. The last one of these especially represents the artistic intent to use depth as element of storytelling, which often stands in contrast to problems that can arise due to inconsistent depth cues. From a production perspective, this forms a highly complex and non-trivial problem for content creation, which has to satisfy all these technical, perceptual, and artistic aspects.

Content-Adaptive Spatial Scalability for Scalable Video Coding Applications
Today, TV and video services are consumed using various types of display devices, e.g. TV sets, tablets, or smart phones. We have a heterogeneous environment of display devices available, where different display aspect-ratios (e.g. 4:3, 16:9) and resolutions (e.g. SDTV, HDTV) are natively supported. Content is usually distributed using single resolution video coding like H.264/AVC which does not allow to control how video content is retargeted on the consumer side. SVC, the scalable extension of H.264/AVC, allows to jointly transmit several videos with different aspect-ratios and resolutions. If SVC is used, a video with the appropriate aspect-ratio could be decoded, while inherent dependencies between the transmitted videos are exploited for an efficient overall compression. However, SVC supports only retargeting by cropping and linear scaling. Recently, we developed a method for efficient Content-adaptive Video Retargeting, which is considered as one of the currently best performing video retargeting methods known in research. But this result leaves open the question of how to efficiently deliver retargeted video content to consumers with different aspect ratios?

In this project, a scalable video coder is developed, which can simultaneously compress several video streams that have different aspect ratios and have been created by content-adaptive retargeting, MPEG's Scalable Video Coder is extended for this purpose. Both video streams image warps are encoded in a scalable way, where the warps are exploited for an efficient compression (i.e. exploitation of redundancies between video streams) by using different techniques for inter-layer prediction. With our extension, video content of higher semantic quality can be transmitted in a scalable way by introducing an average overhead in bit rate of 9.3%

Contrast-Based Visual Saliency Estimation
Saliency estimation has become a valuable tool in image processing. Yet, existing approaches exhibit considerable variation in methodology, and it is often difficult to attribute improvements in result quality to specific algorithm properties. In this work, we reconsider some of the design choices of previous methods and propose a conceptually clear and intuitive algorithm for contrast-based saliency estimation.

Our algorithm consists of four basic steps. First, our method decomposes a given image into compact, perceptually homogeneous elements that abstract unnecessary detail. Based on this abstraction, we compute two measures of contrast that rate the uniqueness and the spatial distribution of these elements. From the element contrast we then derive a saliency measure that produces a pixel-accurate saliency map that uniformly covers the objects of interest and consistently separates foreground and background. We show that the complete contrast and saliency estimation can be formulated in a unified way using high dimensional Gaussian filters. This result contributes to the conceptual simplicity of our method and lends itself to a highly efficient implementation with linear complexity. In a detailed experimental evaluation, we analyze the contribution of each individual feature and show that our method outperforms all state-of-the-art approaches at the time of publication

Feature Flow (Practical Temporal Consistency)
We present an efficient and simple method for introducing temporal consistency to a large class of optimization driven image-based computer graphics problems. Our method extends recent work in edge-aware filtering, approximating costly global regularization with a fast iterative joint filtering operation.

We extend a recent advance in edge-aware filtering called the domain transform, that embeds geodesic distance onto a manifold that allows us to perform filtering with a separable Gaussian filter. This fact causes the spatiotemporal extension to have only a linear (instead of quadratic) increase of runtime.

Using this representation, we can achieve tremendous efficiency gains both in terms of memory requirements and running time. This efficiency gain enables us to process entire shots at once, taking advantage of supporting information that exists across far away frames, something that is difficult with existing approaches due to the computational burden of video data.

Our method is able to filter along motion paths using an iterative approach that simultaneously uses and estimates per-pixel optical flow vectors. We demonstrate its utility by creating temporally consistent results for a number of applications including optical flow, disparity estimation, colorization, scribble propagation, sparse data up-sampling, and visual saliency computation

Lucid Dreams of Gabriel
“Lucid Dreams Of Gabriel,” an experimental short film created by Disney Research in collaboration with ETH, Zürich, was shot at 120fps/RAW in Engadin, Switzerland, one of the most spectacular mountain areas in the country. The film demonstrates effects developed as part of our research program in video processing.

Megastereo: Constructing High Resolution Stereo Panoramas
We present a solution for generating high-quality stereo panoramas at megapixel resolutions. While previous approaches introduced the basic principles, we show that those techniques do not generalize well to today's high image resolutions and lead to disturbing visual artifacts. As out first contribution, we describe the necessary correction steps and a compact representation for the input images in order to achieve a highly accurate approximation to the required ray space. Our second contribution is a flow-based upsampling of the available input rays which effectively resolves known aliasing issues like stitching artifacts. The required rays are generated on the fly to perfectly match the desired output resolution, even for small numbers of input images. In addition, the upsampling is real-time and enables direct interactive control over the desired stereoscopic depth effect. In combination, our contributions allow the generation of stereoscopic panoramas at high output resolution that are virtually free of artifacts such as seams, stereo discontinuities, vertical parallax and other mono-/stereoscopic shape distortions. Our process is robust, and other types of multi-perspective panoramas, such as linear panoramas, can also benefit from our contributions. We show various comparisons and high-resolution results.

Multi-Perspective Stereoscopy from Light Fields
Three-dimensional stereoscopic television, movies, and games have been gaining more and more popularity both within the entertainment industry and among consumers. However, the task of creating convincing yet perceptually pleasing stereoscopic content remains difficult because post-processing tools for stereo are still underdeveloped, and one often has to resort to traditional monoscopic tools and workflows, which are generally ill-suited for stereo-specific issues.

The main cue responsible for stereoscopic scene perception is binocular parallax (or binocular disparity) and therefore tools for manipulating binocular parallax are extremely important. One of the most common methods for controlling the amount of binocular parallax is based on setting the baseline of two cameras prior to acquisition. However, the range of admissible baselines is quite limited because most scenes exhibit more disparity than humans can tolerate when viewing the content on a stereoscopic display. Reducing baseline decreases the amount of binocular disparity; but it also causes scene elements to be overly flat. The second, more sophisticated approach to disparity control requires remapping image disparities (or remapping the depth of scene elements), and then re-synthesizing new images. This approach has considerable disadvantages as well; for content captured with stereoscopic camera rigs, it typically requires accurate disparity computation and hole filling of scene elements that become visible in the re-synthesized views.

In this project we propose a novel concept for stereoscopic post-production to resolve these issues. The main contribution is a framework for creating stereoscopic images, with accurate and flexible per-pixel control over the resulting image disparities. Our framework is based on the concept of 3D light fields, assembled from a dense set of perspective images. While each perspective image corresponds to a planar cut through a light field, our approach defines each stereoscopic image pair as general cuts through this data structure, i.e. each image is assembled from potentially many perspective images. We show how such multi-perspective cuts can be employed to compute stereoscopic output images that satisfy an arbitrary set of goal disparities. These goal disparities can be defined either automatically by a disparity remapping operator or manually by the user for artistic control and effects.

In summary, our proposed concept and formulation provides a novel, general framework that leverages the power and flexibility of light fields for stereoscopic content processing and optimization

Multi-Sensor FusionCam
Traditional cinematographic cameras consist of a single camera, while two-camera cinematographic rigs have also become common with the recent wave of 3D cinema. Taking camera design a further step, this project proposes a system in which a central cinematographic camera is augmented with a clip-on frame of satellite sensors. The satellite devices include compact cameras, a depth sensor, and a thermal camera. The result is a FusionCam that supports more powerful post-production analysis than is possible with a single camera or a two-camera rig, and is able to synthetically generate stereoscopic 3D imagery with specified stereo parameters.

The core research challenge is to produce better depth maps by integrating the high-resolution image from the central cinematographic camera with the information from the satellite sensor modalities. Performing fusion of these different modalities raises questions regarding how the strengths of the modalities can be best exploited, and how the weaknesses of each can best be compensated for. Current work is on analysis of the data at a single time instance, and new work will extend this to temporal analysis of video.

Non-linear Disparity Mapping for Stereoscopic 3D
Stereoscopic 3D creates the illusion of depth. However, extremely careful design is necessary to ensure an excellent user experience, which has to consider display technology, human visual perception and artistic intent. One important functionality in this context is the ability to change the disparity composition (and with that, the depth perception) of the stereo content AFTER capture. This was not supported satisfactory by any system so far. In this ground-breaking work we developed algorithms that provide full control over disparity of given stereo.

Optimizing Stereo-to-Multiview Conversion for Autostereoscopic Displays
We present a novel stereo-to-multiview video conversion method for glasses-free multiview displays. Different from previous stereo-to-multiview approaches, our mapping algorithm utilizes the limited depth range of autostereo- scopic displays optimally and strives to preserve the scene’s artistic composition and perceived depth even under strong depth compression. We first present an investigation of how subjective perceived image quality relates to spatial frequency and disparity. The outcome of this study is utilized in a two-step mapping algorithm, where we (i) compress the scene depth using a non-linear global function to the depth range of an autostereoscopic display, and (ii) enhance the depth gradients of salient objects to restore the perceived depth and salient scene structure. Finally, an adapted image domain warping algorithm is proposed to generate the multiview output, which enables overall disparity range extension.

Painting By Feature
In this project, we propose a reinterpretation of the brush and the fill tools for digital image painting. The core idea is to provide intuitive tools that allow a user to paint in the visual style of arbitrary example images. Rather than a static library of colors, brushes, or fill patterns, our method offers users entire images as their palette, allowing them to select arbitrary contours or textures to be used as their brush or fill tool in their own creations. For realizing the brush tool, we propose a randomized graph-traversal algorithm which synthesizes a seamless stroke in real-time from a user-selected contour in another image. The fill tool combines the patch-match algorithm and inpainting techniques to achieve a similar style transfer for textured regions, and ensures seamless transitions between strokes and filled areas. Our tools allow users to intuitively create visually appealing images that preserve the visual richness and naturalness of arbitrary example images. We demonstrate the potential of our tools in various applications including interactive image creation, editing and vector image stylization.

Phase-Based Retiming and Frame Interpolation

Scalable Structure from Motion for Densely Sampled Videos and Light Fields

Stereo to Multi-View Conversion
Content creation for autostereoscopic displays is an unresolved problem. Typical methods rely on view synthesis based on depth image based rendering (DIBR). DIBR also forms the core of a corresponding standardization activity in MPEG, which correspondingly aims at efficient coding and transmission of multiview video plus depth (MVD) data to support MADs. However, this approach relies on depth estimation, which is an ill-posed and unresolved task so far. It is highly questionable if automatic depth estimation can be resolved with sufficient accuracy, reliability and robustness in the near future. Our method applies purely image domain warping instead. Input video is analyzed and information about sparse disparity, vertical edges and saliency is extracted. A constrained energy minimization problem is formulated and efficiently solved. The resulting image warping functions are used to synthesize novel views. Our approach is fully automatic, accurate, and reliable. Disocclusions and related artifacts are avoided due to smooth, saliency-driven warping functions. Our method also works well for extrapolation of views in a limited range, thus supporting multiview creation from stereo input, which is the most relevant use case scenario.

Our method was used as part of a proposal to MPEG for a standard on 3D Video Technology. There it was evaluated in the scope of a subjective testing procedure, which was conducted anonymously, and in which more than 600 subjects participated. Our proposal was always ranked among the four best performing approaches where the concrete rank depends on the testing category

Transfusive Image Manipulation
We present a method for consistent automatic transfer of edits applied to one image to many other images of the same object or scene. By introducing novel, content-adaptive weight functions we enhance the non-rigid alignment framework of Lucas-Kanade to robustly handle changes of viewpoint, illumination and non-rigid deformations of the subjects. Our weight functions are content-aware and possess high-order smoothness, enabling to define high-quality image warping with a low number of parameters using spatially-varying weighted combinations of affine deformations. Optimizing the warp parameters leads to subpixel-accurate alignment while maintaining computation efficiency. Our method allows users to perform precise, localized edits such as simultaneous painting on multiple images in real-time, relieving them from tedious and repetitive manual reapplication to each individual image.

Video Quality Control Toolbox
he goal of this project is developing a set of tools that will enable automatic and/or semi-automatic quality control of video content. We will start by implementing current state-of-the-art objective quality assessment methods and investigating their performance. We will also invest time in building an intuitive user interface that will allow the users to conveniently work with the underlying quality assessment algorithms.

New measurements presented in recent work revealed weaknesses of image quality metrics ( The research aspect of the project is to improve the current state-of-the-art in quality assessment, possibly through the use of multimodal saliency framework developed for the "Gaze Depth Prediction for Stereoscopic Video" ( project.

Video Retargeting
Video retargeting refers to the process of changing the aspect ratio of a given video signal, e.g. from 16:9 to 4:3 or cinemascope to 16:9. With the growing diversity of terminal devices, such operations remain important in production and distribution. Simple operations like linear scaling, letter boxing or pan-scan cropping are often inacceptable as they cannot make use of the full display or do not handle the content faithfully. Instead, our system applies non-linear warping of the images in a content-adaptive way. Important areas (such as faces, salient objects) remain unchanged while stretching and squeezing is hidden in image areas, where it is less noticeable. Automatic image analysis extracts information from the video (e.g. saliency, edges). For offline production this can be combined with interactive user input. A constraint energy minimization results in a non-linear warping function to deform the images to the new aspect ratio in a content-adaptive way. Besides immediate application for video retargeting, this ground-breaking project laid the general foundation of know-how about non-linear, content-adaptive warping of images and video to change certain properties. In the next step, this was extended to stereoscopic 3D.

VideoSnapping: Interactive Synchronization of Multiple Videos
Aligning video is a fundamental task in computer graphics and vision, required for a wide range of applications. We present an interactive method for computing optimal nonlinear temporal video alignments of an arbitrary number of videos. We first derive a robust approximation of alignment quality between pairs of clips, computed as a weighted histogram of feature matches. We then find optimal temporal mappings (constituting frame correspondences) using a graph-based approach that allows for very efficient evaluation with artist constraints. This enables an enhancement to the “snapping” interface in video editing tools, where videos in a timeline are now able snap to one another when dragged by an artist based on their content, rather than simply start-and-end times. The pairwise snapping is then generalized to multiple clips, achieving a globally optimal temporal synchronization that automatically arranges a series of clips filmed at different times into a single consistent time frame. When followed by a simple spatial registration, we achieve high quality spatiotemporal video alignments at a fraction of the computational complexity compared to previous methods. Assisted temporal alignment is a degree of freedom that has been largely unexplored, but is an important task in video editing. Our approach is simple to implement, highly efficient, and very robust to differences in video content, allowing for interactive exploration of the temporal alignment space for multiple real world HD videos.

Warp Coding for 3D Video (MPEG)
In this project, methods for efficient warp coding are investigated. Coded warps are transmitted together with stereo or multi-view video to enable additional functionalities at the receiver-side through image-domain-warping, e.g. depth adaption for stereo displays or support of multi-view autostereoscopic displays.

One coding method which we investigate is performed by partitioning warps using a resolution pyramid and predictively exploiting intra and inter partition dependencies. View synthesis is employed within the coding loop to control the overall coding process, i.e. to evaluate the contribution of coded partitions to the synthesis quality. We have shown that coded warps represent a practically negligible portion of about 3.6% of the overall (video+warp) bit rate. Furthermore, we showed that a transmission of warps leads to a reduction of synthesis time up to a factor of 8 in comparison to a fully automatic receiver-side view synthesis which uses only decoded video as input.

A second method we investigate employs state-of-the-art video coding methods for warp coding. We are actively contributing to MPEG with these types of warping coding methods with the goal of extending the syntax of the upcoming video coding standard HEVC to enable a transmission of warps along with video data