CORDIS - Résultats de la recherche de l’UE
CORDIS

Programming with Millions of Examples

Final Report Summary - PRIME (Programming with Millions of Examples)

The PRIME project developed techniques for learning from “big code” and using the learned models for a wide range of applications. We show how to leverage a combination of program analysis and machine learning to build expressive models of how real-world software is built. These models are used for program synthesis, program validation, and reverse engineering.

Representations

Leveraging machine learning models for predicting program properties such as variable names, method names, and expression types is a topic of much recent interest. These techniques are based on learning a statistical model from a large amount of code and using the model to make predictions in new programs. A major challenge in these techniques is how to represent instances of the input space to facilitate learning. Designing a program representation that enables effective learning is a critical task that is often done manually for each task and programming language.

In the course of the project, we developed several representations for learning from programs. Our approach uses different path-based abstractions of the program’s abstract syntax tree. This family of path-based representations is natural, general, fully automatic, and works well across different tasks and programming languages.

Predicting Program Properties with Neural Networks

We developed a framework for predicting program properties using neural networks. The main idea is a neural network that learns code embeddings - continuous distributed vector representations for code. The code embeddings allow us to model correspondence between code snippet and labels in a natural and effective manner. By learning code embeddings, our long term goal is to enable the application of neural techniques to a wide-range of programming-languages tasks.

In the course of the project, we developed several neural models for code, including "code2vec", a model that predicts descriptive labels for a given code snippet, and "code2seq", a model that a predicts natural language sentence that describes a given code snippet.

A live demo of the framework is available at https://code2vec.org and https://code2seq.org

Our neural network architecture uses a representation of code snippets that leverages the structured nature of source code, and learns to aggregate multiple syntactic paths into a single vector. This ability is fundamental for the application of deep learning in programming languages. By analogy, word embeddings in natural language processing (NLP) started a revolution of application of deep learning for NLP tasks.

Interpreting Neural Networks

Despite the “black-box” reputation of neural networks, our model is partially interpretable thanks to the attention mechanism, which allows us to visualize the distribution of weights over the bag of path-contexts.

In addition, we developed technique for extracting finite-state automata from recurrent neural networks (RNNs)

Interaction Models for Synthesis

As part of our journey towards "Augmented Programmer Intelligence", we also developed several interaction models that allow a developer interact with a synthesis engine, and show that our models are more effective in solving programming challenges with the assistance of a synthesizer.