This hybrid AI system can understand causality in controlled environments

This hybrid AI system can understand causality in controlled environments

Look at the short video below. Can you answer the following questions: Which object caused the ball to change direction? Where will the ball go next? What would happen if you removed the bat from the scene?

This hybrid AI system can understand causality in controlled environments

You might consider these questions very dumb. But interestingly, today’s most advanced artificial intelligence systems would struggle to answer them. Questions such as the ones asked above require the ability to reason about objects and their behaviors and relations over time. This is an integral component of human intelligence, but one that has remained elusive to AI scientists for decades.

A new study presented at ICLR 2020 by researchers at IBM, MIT, Harvard, and DeepMind highlight the shortcomings of current AI systems in dealing with causality in videos. In their paper, the researchers introduce CLEVRER, a new dataset and benchmark to evaluate the capabilities of AI algorithms in reasoning about video sequences, and Neuro-Symbolic Dynamic Reasoning (NS-DR), a hybrid AI system that marks a substantial improvement on causal reasoning in controlled environments.

Read: [Microsoft’s new AI can generate smart to-do lists from your emails]

Why artificial intelligence can’t reason about videos

For us humans, detecting and reasoning about objects in a scene almost go hand in hand. But for current artificial intelligence technology, they’re two fundamentally different disciplines.

In the past years, deep learning has brought great advances to the field of artificial intelligence. Deep neural networks, the main component of deep learning algorithms, can find intricate patterns in large sets of data. This enables them to perform tasks that were previously off-limits or very difficult for computer software, such as detecting objects in images or recognizing speech.

It’s amazing what pattern recognition alone can achieve. Neural networks play an important role in many of the applications we use every day, from finding objects and scenes in Google Images to detecting and blocking inappropriate content on social media. Neural networks have also made some inroads in generating descriptions about videos and images.

But there are also very clear limits to how far you can push pattern recognition. While an important part of human vision, pattern recognition is only one of its many components. When our brain parses the baseball video at the beginning of this article, our knowledge of motion, object permanence, solidity, and motion kick in. Based on this knowledge, we can predict what will happen next (where the ball will go) and counterfactual situations (what if the bat didn’t hit the ball). This is why even a person who has never seen baseball played before will have a lot to say about this video.

A deep learning algorithm, however, detects the objects in the scene because they are statistically similar to thousands of other objects it has seen during training. It knows nothing about material, gravity, motion, and impact, some of the concepts that allow us to reason about the scene.

Visual reasoning is an active area of research in artificial intelligence. Researchers have developed several datasets that evaluate AI systems’ ability to reason over video segments. Whether deep learning alone can solve the problem is an open question.

Some AI scientists believe that given enough data and compute power, deep learning models will eventually be able to overcome some of these challenges. But so far, progress in fields that require commonsense and reasoning has been little and incremental.

The CLEVRER dataset

The new dataset introduced at ICLR 2020 is named “CoLlision Events for Video REpresentation and Reasoning,” or CLEVRER. It is inspired by CLEVR, a visual question-answering dataset developed at Stanford University in 2017. CLEVR is a set of problems that present still images of solid objects. The AI agent must be able to parse the scene and answer multichoice questions about the number of objects, their attributes, and their spatial relationships.

This hybrid AI system can understand causality in controlled environments