29 posts

Oct 25 2021

CodeMatcher performs fuzzy search

Title: CodeMatcher, Searching Code Based on Sequential Semantics of Important Query Words
Authors: Chao Liu - Xin Xia - David Lo - Zhiwe Liu - Ahmed E. Hassan - Shanping Li

Abstract: To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions. 

Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially. 


Sep 27 2021

Facebook GSLM textless NLP


Facebook recently introduced a generative spoken language model (GSLM) called textless NLP.
The research team believes that their GSLM can be an effective method for pre-training downstream tasks trained with few available labelled or annotated data, like spoken summarization, information retrieval tasks, and sentiment analysis.
GSLM uses the latest breakthroughs in representation learning, allowing it to work directly from raw audio signals, without any text or labels. According to Facebook, this opens the door to a new era of textless NLP applications for potentially every language spoken on Earth — even those without significant or limited text datasets. In addition, it enables the development of NLP models that incorporate the full range of expressivity of oral language.

Sep 16 2021

NLP-Based Source Code Analysis Tools

Tree Technology, a partner in DECODER Project, posted an article about recent R&D efforts. 

Abstract: We have used Natural Language Processing (NLP) techniques in tools aimed to support and improve the software development and software quality processes for Java and C/C++ languages.

The use of complex models has increased performance in many common NLP tasks, such as named entity recognition, text classification, summarisation and translation among others. Besides, transfer learning has also become an interesting option when not much labelled data is available and knowledge learnt from one problem can be applied to a new but related task. In this context, our two NLP-based source code analysis tools - namely Variable Misuse and Code Summarisation - have been conceived by and for software developers.

Read more

Aug 10 2021

Automated Classification of Overfitting Patches with Statically Extracted Code


Authors: He Ye; Jian Gu; Matias Martinez; Thomas Durieux; Martin Monperrus
Contact: - KTH Royal Institute of Technology School of Computer Science and Communication, 156318 Stockholm, Stockholm, Sweden
Publication: IEEE Transactions on Software Engineering - Article link

Automatic program repair (APR) aims to reduce the cost of manually fixing software defects. However, APR suffers from generating a multitude of overfitting patches, those patches that fail to correctly repair the defect beyond making the tests pass. This paper presents a novel overfitting patch detection system called ODS to assess the correctness of APR patches. ODS first statically compares a patched program and a buggy program in order to extract code features at the abstract syntax tree (AST) level. Then, ODS uses supervised learning with the captured code features and patch correctness labels to automatically learn a probabilistic model. The learned ODS model can then finally be applied to classify new and unseen program repair patches. We conduct a large-scale experiment to evaluate the effectiveness of ODS on patch correctness classification based on 10,302 patches from Defects4J, Bugs.jar and Bears benchmarks. The empirical evaluation shows that ODS is able to correctly classify 71.9% of program repair patches from 26 projects, which improves the state-of-the-art. ODS is applicable in practice and can be employed as a post-processing procedure to classify the patches generated by different APR systems.

Ref: H. Ye, J. Gu, M. Martinez, T. Durieux and M. Monperrus, "Automated Classification of Overfitting Patches with Statically Extracted Code Features," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3071750.

Jul 08 2021

Deep Learning for Code Auto-Completion


Title: Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
Authors: Juan Cruz-Benito, Sanjay Vishwakarma, Francisco Martin-Fernandez and Ismael Faro
Journal: AI. 2021

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Jun 03 2021

XWiki Leverages STAMP Software Testing Suite

What is it that characterises XWiki application lifecycle management?

XWiki  is an open source project born 16 years ago, which is continuously improving through 40 new versions per year. This enterprise wiki provides a platform for application development and offers 700 extensions. All together, it contains more than one million lines of java and javascript code. Every month, nearly 50 contributors are working on XWiki (including translations), including around 10 developers regularly involved in the core of the platform. The Build phase is very tool-intensive. A lot of tests and verifications are carried out to ensure optimal quality and compatibility with previous versions.

Which of the automated testing tools in the STAMP project are used regularly by XWiki?

We perform integration tests, unit tests and functional tests on several browsers, with several versions of java, several DBMS and in various Servlet engines. The testing tools developed during the STAMP project help us to validate our test coverage, to improve quality. Each new code that joins an XWiki module is checked in the Build phase to ensure that the quality of the module will be equal or superior to that of the previous version. 


May 12 2021

BotPress NLP Open Source Stack


In a recent podcast with Software Engineering Daily, Sylvain Perron, CEO of BotPress, explains how BotPress is different from other bot platforms.

"The biggest competitors are the Google and Microsoft, where they offer natural language as a service. And I think that's very difficult to get something high-quality out of those service. And that's kind of like the Firebase approach versus Postgres, whereas with Firebase, like you don't control really well like all the configuration and options. 

And the way Botpress works is, and we're sort of the only one that does that, it's an open source stack. You run it on your computer. You can actually customize everything behind. And so you can really get the extra juice out of the engine. You can really fine tune anything you want. And also, the other advantage is that you can actually host that platform anywhere you want. So if you want to deploy on AWS or on Azure, you can do that, whereas if you go with the major cloud platform, you're actually stuck with that vendor. 

And so it's not very flexible. And so imagine you're your bank or healthcare provider, the idea of streaming all of your customers’ interactions over to Google might be frightening. So for any kind of application, I think developers want this kind of experience where they have control over the stack. And I don't think it feels natural to use just like an HTTP service that does that for you and you have no control. It's like a black box and anything can break at any moment. With Botpress, it's much more natural. It feels like regular software."

Apr 20 2021

Adaptation of Cartesian Genetic Programming for Automatic Repair of Software Regression Faults


Title: CGenProg: Adaptation of cartesian genetic programming with migration and opposite guesses for automatic repair of software regression faults
Authors: Alireza Khalilian, Ahmad Baraani-Dastjerdi, Bahman Zamani
Journal: Expert Systems with Applications
Date: 1 May 2021
Read the full paper


  • CGenProg proposed for automatic repair of software regression faults in Java programs.
  • Cartesian genetic programming as the core evolutionary algorithm was adapted and modified.
  • Biogeography-based optimization (migration) as the crossover was adapted.
  • Opposition-based learning (opposite guesses) as the mutation was adapted.

Mar 15 2021

Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI


Title: Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI
Authors: Holzinger Andreas, Malle Bernd, Saranti Anna, Pfeifer Bastian. (2021)
Journal: Information Fusion
Publisher: Elsevier

The authors describe a novel, holistic approach to an automated medical decision pipeline, building on state-of-the-art Machine Learning research, yet integrating the human-in-the-loop via an innovative, interactive & exploration-based explainability technique called counterfactual graphs. They outline the necessity of computing a joint multi-modal representation space in a decentralized fashion, for the reasons of scalability and performance as well as ever-evolving data protection regulations. This effort is indented as a motivation for the international research community and a launchpad for further work in the fields of multi-modal embeddings, interactive explainability, counterfactuals, causability, as well as necessary foundations for effective future human–AI interfaces.


Mar 05 2021

Sustainable computational science: the ReScience initiative


Title: Sustainable computational science: the ReScience initiative
Authors: Nicolas Rougier,  Hinsen Konrad and others
Journal: PeerJ Computer Science
Publisher: PeerJ Inc.

Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results; however, computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true.
James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews.  Existing journals have been slow to adapt: source codes are rarely requested and are hardly ever actually executed to check that they produce the results advertised in the article.
ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests.