Computer Vision

Guest interaction at theme parks, motion capture for studios, and sports visualization are just a few of the direct applications for our computer-vision research. We also perform research in which computer vision intersects with human-computer interaction, video processing, display technology, and optics: it plays a role in our work on input devices, content-aware video processing, projector-camera systems, and computational cinematography.

Projects

(in alphabetical order)

“How to Get an Open Shot”: Analyzing Team Movement in Basketball using Tracking Data
In this paper, we use ball and player tracking data from STATS SportsVU from the 2012-2013 NBA season to analyze offensive and defensive formations of teams. We move beyond current analysis that uses only play-by-play event-driven statistics (i.e., rebounds, shots) and look at the spatiotemporal changes in a team’s formation. A major concern, which also gives a clue to unlocking this problem, is that of permutations caused by the constant movement and interchanging of positions by players. In this paper, we use a method that represents a team via “role” which is immune to the problem of permutations. We demonstrate the utility of our approach by analyzing all the plays that resulted in a 3-point shot attempt in the 2012-2013 NBA season. We analyzed close to 20,000 shots and found that when a player is “open” the shooting percentage is around 40%, compared to a “pressured” shot which is close to 32%. There is nothing groundbreaking behind this finding (i.e., putting more defensive pressure on the shooter reduces shooting percentages) but finding how teams get shooters open is. Using our method, we show that the amount of defensive role-swaps are predictive of getting an open-shot and this measure can be used to measure the defensive effectiveness of a team. Additionally, our role representation allows for large-scale retrieval of plays by using the tracking data as the input query rather than a text label - this “video Google” approach allows for quick and accurate play retrieval.

Assessing Team Strategy Using Spatiotemporal Data
The Moneyball revolution coincided with a shift in the way professional sporting organizations handle and utilize data in terms of decision making processes. Due to the demand for better sports analytics and the improvement in sensor technology, there has been a plethora of ball and player tracking information generated within professional sports for analytical purposes.

Augmenting Physical Avatars Using Projector-Based Illumination
In this project, we propose a complete process for augmenting physical avatars using projector based illumination, significantly increasing their expressiveness. Given an input animation, the system decomposes the motion into low-frequency motion that can be physically reproduced by the animatronic head and high-frequency details that are added using projected shading. At the core is a spatio-temporal optimization process that compresses the motion in gradient space, ensuring faithful motion replay while respecting the physical limitations of the system. We also propose a complete multi-camera and projection system, including a novel defocused projection and subsurface scattering compensation scheme. The result of our system is a highly expressive physical avatar that features facial details and motion otherwise unreachable due to physical constraints.

Bilinear Spatiotemporal Basis Models
In this paper, we present the bilinear spatiotemporal basis as a model that simultaneously exploits spatial and temporal regularity while maintaining the ability to generalize well to new sequences. The model can be interpreted as representing the data as a linear combination of spatiotemporal sequences consisting of shape modes oscillating over time at key frequencies.

Cancelling Ambient Occlusion
Tracking deforming surfaces is an important sub-problem for many scenarios, such as facial performance capture. When a surface is deformed, the changing shape can cause a change in appearance (shading), e.g. when forming wrinkles. These appearance changes pose problems for most existing tracking methods, as they rely on the local appearance to be constant.

We present a general technique for improving space-time reconstructions of deforming surfaces, which are captured in an video-based reconstruction scenario under uniform illumination. Our approach simultaneously improves both the acquired shape as well as the tracked motion of the deforming surface. The method is based on factoring out surface shading, computed by a fast approximation to global illumination called ambient occlusion. We show that once local surface shading has been removed, optical flow can be computed much more robustly and with higher accuracy. While cancelling the local shading, we also optimize the surface shape to minimize the residual between the ambient occlusion of the 3D geometry and that of the image, yielding more accurate surface details in the reconstruction. Our enhancement is independent of the actual space-time reconstruction algorithm. We experimentally measure the quantitative improvements produced by our algorithm using a synthetic example of deforming skin, where ground truth shape and motion is available. We further demonstrate our enhancement on a real-world sequence of human face reconstruction

ChaCra: An Interactive Design System for Rapid Character Crafting
We propose an interactive design system for rapid crafting of planar mechanical characters. Our method combines the simplicity of sketch-based modeling with the ease of defining motion through extreme poses.

Coupled 3D Reconstruction of Sparse Facial Hair and Skin
Although facial hair plays an important role in individual expression, facial-hair reconstruction is not addressed by current face-capture systems. Our research addresses this limitation with an algorithm that treats hair and skin surface capture together in a coupled fashion so that a high-quality representation of hair fibers as well as the underlying skin surface can be reconstructed. We propose a passive, camera-based system that is robust against arbitrary motion since all data is acquired within the time period of a single exposure. Our reconstruction algorithm detects and traces hairs in the captured images and reconstructs them in 3D using a multi-view stereo approach. Our coupled skin-reconstruction algorithm uses information about the detected hairs to deliver a skin surface that lies underneath all hairs irrespective of occlusions. In dense regions like eyebrows, we employ a hair-synthesis method to create hair fibers that plausibly match the image data. We demonstrate our scanning system on a number of individuals and show that it can successfully reconstruct a variety of facial-hair styles together with the underlying skin surface.

High-Quality Passive Facial Performance Capture
We present a new technique for passive and markerless facial performance capture based on anchor frames. Our method starts with high resolution per-frame geometry acquisition using state-of-the-art stereo reconstruction, and proceeds to establish a single triangle mesh that is propagated through the entire performance. Leveraging the fact that facial performances often contain repetitive subsequences, we identify anchor frames as those which contain similar facial expressions to a manually chosen reference expression. Anchor frames are automatically computed over one or even multiple performances. We introduce a robust image-space tracking method that computes pixel matches directly from the reference frame to all anchor frames, and thereby to the remaining frames in the sequence via sequential matching. This approach allows us to propagate one reconstructed frame to an entire sequence in parallel, in contrast to previous sequential methods. Our anchored reconstruction approach also limits tracker drift and robustly handles occlusions and motion blur. The parallel tracking and mesh propagation offer low computation times. Our technique will even automatically match anchor frames across different sequences captured on different occasions, propagating a single mesh to all performances

High-Quality Single-Shot Capture of Facial Geometry
This project develops a passive stereo system for capturing the 3D geometry of a face in a single-shot with standard light sources. The system is low-cost and easy to deploy. Results are sub-millimeter accurate and commensurate with those from state-of-the-art systems based on active lighting (e.g., laser scanners), and the models meet the quality requirements of a demanding domain like the movie industry. Recovered models are shown for captures from both high-end cameras in a studio setting and from a consumer binocular-stereo camera, demonstrating scalability across a spectrum of camera deployments, and showing the potential for 3D face modeling to move beyond the professional arena and into the emerging consumer market in stereoscopic photography.

Hybrid Robotic/Virtual PTZ Cameras for Autonomous Event Recording
We present a method to generate aesthetic video from a robotic camera by incorporating a virtual camera operating on a delay, and a hybrid controller which uses feedback from both the robotic and virtual cameras. Our strategy employs a robotic camera to follow a coarse region-of-interest identifi ed by a realtime computer vision system, and then resamples the captured images to synthesize the video that would have been recorded along a smooth, aesthetic camera trajectory. The smooth motion trajectory is obtained by operating the virtual camera on a short delay so that perfect knowledge of immediate future events is known.

Image-Based Reconstruction and Synthesis of Dense Foliage
Flora is an element in many computer-generated scenes. But trees, bushes and plants have complex geometry and appearance, and are difficult to model manually. One way to address this problem is to capture models directly from the real world. Existing techniques have focused on extracting macro structure such as the branching structure of trees, or the structure of broad-leaved plants with a relatively small number of surfaces. This project presents a finer scale technique to demonstrate for the first time the processing of densely leaved foliage - computation of 3D structure, plus extraction of statistics for leaf shape and the configuration of neighboring leaves.

Our method starts with a mesh of a single exemplar leaf of the target foliage. Using a small number of images, point cloud data is obtained from multi-view stereo, and the exemplar leaf mesh is fitted non-rigidly to the point cloud over several iterations. Initialization of the fitting is done using RANSAC, making it robust to outliers in the stereo reconstruction and suitable for the chaotic point cloud obtained from foliage. In addition, our method learns a statistical model of leaf shape and appearance during the reconstruction phase, and a model of the transformations between neighboring leaves. This information can subsequently be used to generate a variety of plausible leaves for that plant species in plausible configurations, and is useful in two ways - to augment and increase leaf density in reconstructions of captured foliage, and to synthesize new foliage that conforms to a user-specified layout and density. The result of our technique is a dense set of captured leaves with realistic appearance, and a method for leaf synthesis. Our approach excels at reconstructing plants and bushes that are primarily defined by dense leaves and is demonstrated with multiple examples.

The first stage of this project has been successfully completed. The results are expected to be integrated with related projects '3D Reconstruction of Natural Environments using Scanning Robots' and '3D Reconstruction of Trees'.

Modeling and Recognising Team Strategies, Tactics and Tendencies in Sports
We introduce a method to represent and discover adversarial group behavior in a continuous domain. In comparison to other types of behavior, adversarial behavior is heavily structured as the location of an player (or agent) is dependent both on their teammates and adversaries, in addition to the tactics or strategies of the team.

Monocular Object Detection Using 3D Geometric Primitives
Multiview object detection methods achieve robustness in adverse imaging conditions by exploiting projective consistency across views. We present an algorithm that achieves performance comparable to multiview methods from a single camera by employing geometric primitives as proxies for the true 3D shape of objects, such as pedestrians or vehicles. Our key insight is that for a calibrated camera, geometric primitives produce predetermined location-specific patterns in occupancy maps.

Motion Capture from Body-Mounted Cameras
Motion capture technology generally requires that recordings be performed in a laboratory or closed stage setting with controlled lighting. This restriction precludes the capture of motions that require an outdoor setting or the traversal of large areas. In this paper, we present the theory and practice of using body-mounted cameras to reconstruct the motion of a subject. Outward-looking cameras are attached to the limbs of the subject, and the joint angles and root pose are estimated through non-linear optimization. The optimization objective function incorporates terms for image matching error and temporal continuity of motion. Structure-from-motion is used to estimate the skeleton structure and to provide initialization for the non-linear optimization procedure. Global motion is estimated and drift is controlled by matching the captured set of videos to reference imagery. We show results in settings where capture would be difficult or impossible with traditional motion capture systems, including walking outside and swinging on monkey bars. The quality of the motion reconstruction is evaluated by comparing our results against motion capture data produced by a commercially available optical system.

Multi-Sensor FusionCam
Traditional cinematographic cameras consist of a single camera, while two-camera cinematographic rigs have also become common with the recent wave of 3D cinema. Taking camera design a further step, this project proposes a system in which a central cinematographic camera is augmented with a clip-on frame of satellite sensors. The satellite devices include compact cameras, a depth sensor, and a thermal camera. The result is a FusionCam that supports more powerful post-production analysis than is possible with a single camera or a two-camera rig, and is able to synthetically generate stereoscopic 3D imagery with specified stereo parameters.

The core research challenge is to produce better depth maps by integrating the high-resolution image from the central cinematographic camera with the information from the satellite sensor modalities. Performing fusion of these different modalities raises questions regarding how the strengths of the modalities can be best exploited, and how the weaknesses of each can best be compensated for. Current work is on analysis of the data at a single time instance, and new work will extend this to temporal analysis of video.

Point-less Calibration: Camera Parameters from Gradient-Based Alignment to Edge Images
Point-based targets, such as checkerboards, are often not practical for outdoor camera calibration, as cameras are usually at significant heights requiring extremely large calibration patterns on the ground. Fortunately, it is possible to make use of existing non-point landmarks in the scene by formulating camera calibration in terms of image alignment. In this paper, we simultaneously estimate the camera intrinsic, extrinsic and lens distortion parameters directly by aligning to a planar schematic of the scene.

Poselet Key-Framing: A Model for Human Activity Recognition
In this paper, we develop a new model for recognizing human actions. An action is modeled as a very sparse sequence of temporally local discriminative keyframes – collections of partial key-poses of the actor(s), depicting key states in the action sequence. We cast the learning of keyframes in a max-margin discriminative framework, where we treat keyframes as latent variables. This allows us to (jointly) learn a set of most discriminative keyframes while also learning the local temporal context between them. Keyframes are encoded using a spatially-localizable poselet-like representation with HoG and BoW components learned from weak annotations; we rely on structured SVM formulation to align our components and mine for hard negatives to boost localization performance. This results in a model that supports spatio-temporal localization and is insensitive to dropped frames or partial observations. We show classification performance that is competitive with the state of the art on the benchmark UT-Interaction dataset and illustrate that our model outperforms prior methods in an on-line streaming setting.

ProCams Toolbox
Projectors are the ultimate programmable light sources and have a tremendous potential to alter the appearance of objects, and even transform entire environments. Multi-projector systems that combine several overlapping projections onto arbitrary objects and geometry require compensation for both shape and color. The Procams Toolbox is an applied project on a software toolbox that can be used to create a range of applications for different projector deployments with varying requirements. The core of the toolbox is a set of standard software components such as projector calibration, camera calibration, screen calibration, intensity compensation and color correction. These components are modular and are plugged together in a universal application framework. The toolbox is being grown into a resource that is useful for any project installation, with research anticipated in projection onto 3D objects and dynamic scenes.

Representing and Discovering Adversarial Team Behaviors Using Player Roles
In this paper, we describe a method to represent and discover adversarial group behavior in a continuous domain. In comparison to other types of behavior, adversarial behavior is heavily structured as the location of a player (or agent) is dependent both on their teammates and adversaries, in addition to the tactics or strategies of the team.

Robust Photometric Projector Compensation
Photometric projector compensation is used in various application fields, such as building projections and augmented reality. Its preparation, however, is still a laborious process. While several algorithms have been already presented which generate compensated projections with impressive quality, they all require the devices to be, at least partially, radiometrically calibrated and accurately registered to the surface. This process can be cumbersome and time consuming reducing the overall flexibility of the system. Furthermore, it makes the algorithms impractical to deploy, because non-experts might not be able to re-calibrate it accurately and thus the projection quality will be significantly degraded. The main goals of this project are: (1) Development of a compensation algorithm which does not require any radiometrical pre-calibration of the cameras and projectors. (2) Research on how to optimize the compensation images such that it even preserves a high perceptual quality in the case of slight misregistrations.

The developed non-linear color mapping methods are also able to generate high quality results in other applications fields, such as display gamut mapping as well as color space mapping for images and videos

Scene Reconstruction from High Spatio-Angular Resolution Light Fields
This paper describes a method for scene reconstruction of complex, detailed environments from 3D light fields. Densely sampled light fields in the order of 10^9 light rays allow us to capture the real world in unparalleled detail, but efficiently processing this amount of data to generate an equally detailed reconstruction represents a significant challenge to existing algorithms. We propose an algorithm that leverages coherence in massive light fields by breaking with a number of established practices in image-based reconstruction. Our algorithm first computes reliable depth estimates specifically around object boundaries instead of interior regions, by operating on individual light rays instead of image patches. More homogeneous interior regions are then processed in a fine-to-coarse procedure rather than the standard coarse-to-fine approaches. At no point in our method is any form of global optimization performed. This allows our algorithm to retain precise object contours while still ensuring smooth reconstructions in less detailed areas. While the core reconstruction method handles general unstructured input, we also introduce a sparse representation and a propagation scheme for reliable depth estimates which make our algorithm particularly effective for 3D input, enabling fast and memory efficient processing of "Gigaray light fields" on a standard GPU. We show dense 3D reconstructions of highly detailed scenes, enabling applications such as automatic segmentation and image-based rendering, and provide an extensive evaluation and comparison to existing image-based reconstruction techniques.

Shadowpix: Multiple Images from Self Shadowing
SHADOWPIX are white surfaces that display several prescribed images formed by the self-shadowing of the surface when lit from certain directions. The effect is surprising and not commonly seen in the real world. We present algorithms for constructing SHADOWPIX that allow up to four images to be embedded in a single surface. SHADOWPIX can produce a variety of unusual effects depending on the embedded images: moving the light can animate or relight the object in the image, or three colored lights may be used to produce a single colored image. SHADOWPIX are easy to manufacture using a 3D printer. Additional photographs, videos, and renderings demonstrating these effects can be found in the accompanied paper and video.

Strategy Analysis in Soccer
In comparison to other types of behavior, adversarial behavior is heavily structured as the location of a player (or agent) is dependent both on their teammates and adversaries, in addition to the tactics or strategies of the team.

Stylized Hair Capture
Recently, we have seen significant growth in the design and fabrication of personalized figurines, created by scanning real people and then physically reproducing miniature statues with 3D printers. The development of novel methods is currently a hot topic both in academia and industry, and the printed figurines are gaining more and more realism, especially with state-of-the-art facial scanning technology improving. However, current systems all contain the same limitation - no previous method is able to suitably capture personalized hair-styles for physical reproduction. Typically, the subject's hair is approximated very coarsely or replaced completely with a template model from a library. In this project, we develop the first method for stylized hair capture, a technique to reconstruct an individual's actual hair-style in a manner suitable for physical reproduction. Inspired by centuries-old artistic sculptures, our method generates hair as a closed-manifold surface, yet contains the structural and color elements stylized in a way that captures the defining characteristics of the hair-style. The key to our approach is a novel multi-view stylization algorithm, which extends feature-preserving color filtering from 2D images to irregular manifolds in 3D, and introduces abstract geometric details that are coherent with the color stylization. The proposed technique fits naturally in existing pipelines for figurine reproduction, and we demonstrate the robustness and versatility of our approach by capturing several subjects with widely varying hair-styles.

“Win at Home and Draw Away”: Automatic Formation Analysis Highlighting the Differences in Home and Away Team Behaviors
In terms of analyzing soccer matches, two of the most important factors to consider are: 1) the formation the team played (e.g., 4-4-2, 4-2-3-1), and 2) the manner in which they executed it (e.g., conservative - sitting deep, or aggressive - pressing high). Despite the existence of ball and player tracking data, no current methods exist which can automatically detect and visualize formations. Using an entire season of Prozone data which consists of ball and player tracking information from a recent top-tier professional league, we showcase an automatic formation detection method by investigating the “home advantage”. In a paper we published recently, using an entire season of ball tracking data we showed that home teams had significantly more possession in the forward third which correlated with more shots and goals while the shooting and passing proficiencies were the same. Using our automatic formation analysis, we extend this analysis, and show that teams tend to play the same formation at home as they do away, but the manner in which they execute it is significantly different. Specifically, we show that the formation of teams at home is significantly higher up the field compared to when they play away. This conservative approach at away games suggests that coaches aim to win their home games and draw their away games. Additionally, we also show that our method can visually summarize a game which gives an indication of dominance and tactics. While enabling new discoveries of team behavior which can enhance analysis, it is also worth mentioning that our automatic formation detection method is the first to be developed.