From Object Recognition to Visual Question Answering

From Object Recognition to Visual Question Answering


After the first of our posts dedicated to the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), I would now like to concentrate on some of the topics that have been discussed in the course of the conference, and then later I will make available a list of the best talks that I have attended in CVPR.

Detecting and recognising objects by automatically extracting information from pictures is one of the most classical problems extensively studied in Computer Vision and even though currently, in several application areas, the results achieved are satisfactory, there are still many examples for which we need better solutions. For instance, an industrial environment represents a difficult context for successful recognition of objects, because many targets may have highly similar appearances, their background may be cluttered and occlusions may frequently occur.

The KMLE team from the Computer Vision and Augmented Reality research area has been focusing on these kinds of problems in the last year and some of our results have been presented at the Open Domain Action Recognition workshop (ODAR) on 21st July 2017 during the CVPR conference. In order to monitor the tasks performed by an operator using an Augmented Reality (AR) device, we have developed a method for automatically recognising an object’s state during an industrial manufacturing or maintenance process. More specifically, by exploiting a pair of smart glasses that integrates a Holographic Optical Element (HOE), our goal is to support the operator to successfully complete complex maintenance or manufacturing activities in an industrial context.

The combination of the smart glasses, or any other cyber physical system, with an object state recognition method can inform the user in case they are incorrectly handling objects or other instruments, and it provides relevant information concerning the task that is performed. The use-case our team has been focusing on is to provide support to a user for correctly positioning a component board in an electronics rack. For more information about our research, you can download our paper about “Object State Recognition for Automatic AR-Based Maintenance Guidance”.

During the course of the CVPR conference, I have also taken part in several workshops: Scene Understanding workshop, Visual Question Answering workshop and Zero-shot Learning tutorial.

An interesting talk given by Ross Girshick demonstrated how semantic understanding might enable us to move from object detection to instance-level segmentation, where the focus is on segmenting different instances of an object. He also presented several suggestions for how to approach such computer vision problems using the “same” padding, which keeps the size of the layer’s output the same as the layer’s input, and the use of fully convolutional networks rather than a fully connected layer only at the end of the network. Then he also presented their survey on the comparison between one- and multi-stage detector architectures.

One of the hot topics within the course of the conference has been “Visual Question Answering” a new and exciting problem that combines Natural Language Processing (NLP) and computer vision techniques. Apart from supporting blind people to understand their surroundings, I am still thinking about what might be the most relevant applications for this topic, but I believe that Visual Question Answering will keep pushing the boundaries of image understanding forward in the future.

Finally, Larry Zitnick gave an inspiring presentation about the transition from Scene Understanding to Artificial Intelligence. What differences lie between these two disciplines? What are the steps required to move from the former to the latter? His message was that Reasoning, supported by Interaction with the scene, would enable a better understanding of the scene itself and its surroundings. Moreover, he also explained that the reason why many of the Visual Question Answering methods currently fail under real conditions, even if they achieve good results in test settings, is because of the bias of the questions they are asked. For instance, there is usually a higher probability of a positive-biased answer to questions starting with, “Is there a …?”. A similar bias applies to video datasets available from YouTube or Facebook: usually people upload videos that contain interesting and often unique and atypical activities: therefore, learning and classifying everyday actions will not be easy with such a dataset of examples.

Pavel Dvorak
Pavel Dvorak
Research Specialist CV Area