Researchers at Massachusetts Institute of Technology (MIT) have developed a new parser — a mechanism for breaking down computer data into smaller elements — that can acquire command of languages through observation along the lines of how children learn.
In computing, learning a language is the task of syntactic and semantic parsers. These systems are trained on sentences annotated by humans that describe the structure and meaning behind words.
“But gathering the annotation data can be time-consuming and difficult for less common languages,” MIT said in a blog post. “Additionally, humans don't always agree on the annotations, and the annotations themselves may not accurately reflect how people naturally speak.”
The new technique, researchers claim, can now greatly enhance the parser’s efficiency in decoding observations.
To learn the structure of a language, the parser observes captioned videos, with no other information, and associates the words with recorded objects and actions.
Given a new sentence, the parser can then use what it has learned about the structure of the language to accurately predict a sentence's meaning, without the video.
This technique that requires ‘weak supervision’ — limited training data — mimics how children perceive everything around them and learn languages, without any set directions or context, the post added.
MIT said that parsers are becoming increasingly important for web searches, natural-language database querying, and voice-recognition systems such as Alexa and Siri.
“The approach could expand the types of data and reduce the effort needed for training parsers,” MIT said. “A few directly annotated sentences, for instance, could be combined with many captioned videos, which are easier to come by, to improve performance.”
The findings were presented earlier this week at the Empirical Methods in Natural Language Processing Conference in Belgium.
“By assuming that all of the sentences must follow the same rules, that they all come from the same language, and seeing many captioned videos, you can narrow down the meanings further,” said co-author Andrei Barbu, a researcher in the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT.
The new parser is the first to be trained using video, said first author Candace Ross, a graduate student in the Department of Electrical Engineering and Computer Science and CSAIL.
In part, videos are more useful in reducing ambiguity, the blog post continued. If the parser is unsure about, say, an action or object in a sentence, it can reference the video to clear things up.
“There are temporal components — objects interacting with each other and with people — and high-level properties you wouldn’t see in a still image or just in language,” Ross said.
In future work, the researchers are interested in modeling interactions, not just passive observations.
“Children interact with the environment as they’re learning. Our idea is to have a model that would also use perception to learn,” added Ross.