Readings
Oct 25 2021
CodeMatcher performs fuzzy search
Title: CodeMatcher, Searching Code Based on Sequential Semantics of Important Query Words
Authors: Chao Liu - Xin Xia - David Lo - Zhiwe Liu - Ahmed E. Hassan - Shanping Li
Abstract: To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR)-based models for code search, but they fail to connect the semantic gap between query and code. An early successful deep learning (DL)-based model DeepCS solved this issue by learning the relationship between pairs of code methods and corresponding natural language descriptions.
Two major advantages of DeepCS are the capability of understanding irrelevant/noisy keywords and capturing sequential relationships between words in query and code. In this article, we proposed an IR-based model CodeMatcher that inherits the advantages of DeepCS (i.e., the capability of understanding the sequential semantics in important query words), while it can leverage the indexing technique in the IR-based model to accelerate the search response time substantially.
Sep 27 2021
Facebook GSLM textless NLP
Facebook recently introduced a generative spoken language model (GSLM) called textless NLP.
The research team believes that their GSLM can be an effective method for pre-training downstream tasks trained with few available labelled or annotated data, like spoken summarization, information retrieval tasks, and sentiment analysis.
GSLM uses the latest breakthroughs in representation learning, allowing it to work directly from raw audio signals, without any text or labels. According to Facebook, this opens the door to a new era of textless NLP applications for potentially every language spoken on Earth — even those without significant or limited text datasets. In addition, it enables the development of NLP models that incorporate the full range of expressivity of oral language.
More
Sep 16 2021
NLP-Based Source Code Analysis Tools
Tree Technology, a partner in DECODER Project, posted an article about recent R&D efforts.
Abstract: We have used Natural Language Processing (NLP) techniques in tools aimed to support and improve the software development and software quality processes for Java and C/C++ languages.
The use of complex models has increased performance in many common NLP tasks, such as named entity recognition, text classification, summarisation and translation among others. Besides, transfer learning has also become an interesting option when not much labelled data is available and knowledge learnt from one problem can be applied to a new but related task. In this context, our two NLP-based source code analysis tools - namely Variable Misuse and Code Summarisation - have been conceived by and for software developers.
Aug 10 2021
Automated Classification of Overfitting Patches with Statically Extracted Code
Authors: He Ye; Jian Gu; Matias Martinez; Thomas Durieux; Martin Monperrus
Contact: heye@kth.se - KTH Royal Institute of Technology School of Computer Science and Communication, 156318 Stockholm, Stockholm, Sweden
Publication: IEEE Transactions on Software Engineering - Article link
Abstract:
Automatic program repair (APR) aims to reduce the cost of manually fixing software defects. However, APR suffers from generating a multitude of overfitting patches, those patches that fail to correctly repair the defect beyond making the tests pass. This paper presents a novel overfitting patch detection system called ODS to assess the correctness of APR patches. ODS first statically compares a patched program and a buggy program in order to extract code features at the abstract syntax tree (AST) level. Then, ODS uses supervised learning with the captured code features and patch correctness labels to automatically learn a probabilistic model. The learned ODS model can then finally be applied to classify new and unseen program repair patches. We conduct a large-scale experiment to evaluate the effectiveness of ODS on patch correctness classification based on 10,302 patches from Defects4J, Bugs.jar and Bears benchmarks. The empirical evaluation shows that ODS is able to correctly classify 71.9% of program repair patches from 26 projects, which improves the state-of-the-art. ODS is applicable in practice and can be employed as a post-processing procedure to classify the patches generated by different APR systems.
Ref: H. Ye, J. Gu, M. Martinez, T. Durieux and M. Monperrus, "Automated Classification of Overfitting Patches with Statically Extracted Code Features," in IEEE Transactions on Software Engineering, doi: 10.1109/TSE.2021.3071750.
Jul 08 2021
Deep Learning for Code Auto-Completion
Title: Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
Authors: Juan Cruz-Benito, Sanjay Vishwakarma, Francisco Martin-Fernandez and Ismael Faro
Journal: AI. 2021
Link: https://doi.org/10.3390/ai2010001
Abstract
In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.
Jun 03 2021
XWiki Leverages STAMP Software Testing Suite
What is it that characterises XWiki application lifecycle management?
XWiki is an open source project born 16 years ago, which is continuously improving through 40 new versions per year. This enterprise wiki provides a platform for application development and offers 700 extensions. All together, it contains more than one million lines of java and javascript code. Every month, nearly 50 contributors are working on XWiki (including translations), including around 10 developers regularly involved in the core of the platform. The Build phase is very tool-intensive. A lot of tests and verifications are carried out to ensure optimal quality and compatibility with previous versions.
Which of the automated testing tools in the STAMP project are used regularly by XWiki?
We perform integration tests, unit tests and functional tests on several browsers, with several versions of java, several DBMS and in various Servlet engines. The testing tools developed during the STAMP project help us to validate our test coverage, to improve quality. Each new code that joins an XWiki module is checked in the Build phase to ensure that the quality of the module will be equal or superior to that of the previous version.
May 12 2021
BotPress NLP Open Source Stack
In a recent podcast with Software Engineering Daily, Sylvain Perron, CEO of BotPress, explains how BotPress is different from other bot platforms.
"The biggest competitors are the Google and Microsoft, where they offer natural language as a service. And I think that's very difficult to get something high-quality out of those service. And that's kind of like the Firebase approach versus Postgres, whereas with Firebase, like you don't control really well like all the configuration and options.
And the way Botpress works is, and we're sort of the only one that does that, it's an open source stack. You run it on your computer. You can actually customize everything behind. And so you can really get the extra juice out of the engine. You can really fine tune anything you want. And also, the other advantage is that you can actually host that platform anywhere you want. So if you want to deploy on AWS or on Azure, you can do that, whereas if you go with the major cloud platform, you're actually stuck with that vendor.
And so it's not very flexible. And so imagine you're your bank or healthcare provider, the idea of streaming all of your customers’ interactions over to Google might be frightening. So for any kind of application, I think developers want this kind of experience where they have control over the stack. And I don't think it feels natural to use just like an HTTP service that does that for you and you have no control. It's like a black box and anything can break at any moment. With Botpress, it's much more natural. It feels like regular software."
- Listen to the podcast
- Read the full transcript
Apr 20 2021
Adaptation of Cartesian Genetic Programming for Automatic Repair of Software Regression Faults
Title: CGenProg: Adaptation of cartesian genetic programming with migration and opposite guesses for automatic repair of software regression faults
Authors: Alireza Khalilian, Ahmad Baraani-Dastjerdi, Bahman Zamani
Journal: Expert Systems with Applications
Date: 1 May 2021
Read the full paper
Highlights
- CGenProg proposed for automatic repair of software regression faults in Java programs.
- Cartesian genetic programming as the core evolutionary algorithm was adapted and modified.
- Biogeography-based optimization (migration) as the crossover was adapted.
- Opposition-based learning (opposite guesses) as the mutation was adapted.
Mar 15 2021
Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI
Title: Towards multi-modal causability with Graph Neural Networks enabling information fusion for explainable AI
Authors: Holzinger Andreas, Malle Bernd, Saranti Anna, Pfeifer Bastian. (2021)
Journal: Information Fusion
Publisher: Elsevier
The authors describe a novel, holistic approach to an automated medical decision pipeline, building on state-of-the-art Machine Learning research, yet integrating the human-in-the-loop via an innovative, interactive & exploration-based explainability technique called counterfactual graphs. They outline the necessity of computing a joint multi-modal representation space in a decentralized fashion, for the reasons of scalability and performance as well as ever-evolving data protection regulations. This effort is indented as a motivation for the international research community and a launchpad for further work in the fields of multi-modal embeddings, interactive explainability, counterfactuals, causability, as well as necessary foundations for effective future human–AI interfaces.
Mar 05 2021
Sustainable computational science: the ReScience initiative
Title: Sustainable computational science: the ReScience initiative
Authors: Nicolas Rougier, Hinsen Konrad and others
Journal: PeerJ Computer Science
Publisher: PeerJ Inc.
Computer science offers a large set of tools for prototyping, writing, running, testing, validating, sharing and reproducing results; however, computational science lags behind. In the best case, authors may provide their source code as a compressed archive and they may feel confident their research is reproducible. But this is not exactly true.
James Buckheit and David Donoho proposed more than two decades ago that an article about computational results is advertising, not scholarship. The actual scholarship is the full software environment, code, and data that produced the result. This implies new workflows, in particular in peer-reviews. Existing journals have been slow to adapt: source codes are rarely requested and are hardly ever actually executed to check that they produce the results advertised in the article.
ReScience is a peer-reviewed journal that targets computational research and encourages the explicit replication of already published research, promoting new and open-source implementations in order to ensure that the original research can be replicated from its description. To achieve this goal, the whole publishing chain is radically different from other traditional scientific journals. ReScience resides on GitHub where each new implementation of a computational study is made available together with comments, explanations, and software tests.
More: https://www.labri.fr/perso/nrougier/papers/10.7717.peerj-cs.142.pdf
Feb 26 2021
Hardware Versus Software Fault Injection of Modern Undervolted SRAMs
Researchers from Barcelona Supercomputing Center (Spain) and Abdullah Gul University in Kayseri (Turkey) are sharing an approach to apply real under-volting SRAM fault maps to a simulated system and observe the resiliency of the applications.
They compare the hardware guided fault injection approach with a random guided fault injection approach. Significant differences appears in the coarse categorization of the resiliency of the application, which become more obvious as the number of faulty bits increases. There are also differences when inspecting the quality of the output among the two techniques. This is because in an realisticsystem not all fault locations have the same probability to present faults, therefore from the software perspective the faults can propagate to a limited number of software structures.
Feb 09 2021
Corrective Commit Probability Code Quality Metric
An article signed by Idan Amit and Dror G. Feitelson from the Department of Computer Science at the Hebrew University of Jerusalem, presents a code quality metric, the Corrective Commit Probability (CCP).
This metric measures the probability that a commit reflects corrective maintenance. The authors think that this metric agrees with developers’ concept of quality, informative, and stable. Corrective commits are identified by applying a linguistic model to the commit messages. The team compute the CCP of all large active GitHub projects (7,557 projects with 200+ com-mits in 2019). This leads to the creation of a quality scale, suggesting that the bottom 10% of quality projects spend at least 6 times more effort on fixing bugs than the top 10%. Analysis of project attributes shows that lower CCP (higher quality) is associated with smaller files, lower coupling, use of languages like JavaScript and C# as opposed to PHP and C++, fewer developers, lower developer churn, better on boarding, and better productivity. Among other things these results support the “Quality is Free” claim, and suggest that achieving higher quality need not require higher expenses.
Jan 18 2021
MongoDB, A Database For Document Stores
A potential prey for Oracle or Microsoft, MongoDB leads the document store market, and is now ranked #5 among all DBMS (source: DB Engines). It is at the heart of the DECODER PKM and also of multiple one-page websites based on the MEAN stack (Angular, MongoDB, Express, NodeJS).
In a recent article, Eric Weiss, Analyst at several large banks, sees MongoDB as the clear-cut leader within the high-growth, non-relational database SaaS sector. "MongoDB has been and will continue to be an indirect beneficiary of high-growth megatrends such as AI, Machine Learning, IoT (Internet-of-the-Things) and digitalization. Each of these trends have sparked an exponential growth in supply of unstructured data resulting in an increasing demand for (NoSQL) non-relational database solutions. Such databases can much more efficiently handle this new flow of data workloads compared to more traditional relational, SQL-based solutions".
- Read the article MongoDB, A Database For The New Era
- When it's time to create your first MDB database
- Upgrading to the release 4? Try this quiz on MongoDB 4 new features and database updates: MDB 4 quiz on Techtarget
Dec 28 2020
Big Code has a direct impact on the business outcomes
For developers, code releases are "emotional" events. Many have fear and anxiety at the moment they release code or submit it for review and fear breaking dependencies.
Indeed, managing large and complex code bases (Big Code) can become laborious, time consuming and costly. Joe McKendrick article refers to a 2020 survey of 500 north American professional developers compiled by Dimensional Data and underwritten by Sourcegraph. The Emergence of Big Code survey highlights a dramatic growth in the volume and complexity of software code.
It's almost unanimous: 99% of respondents report that big code has a direct impact on the business outcomes of software development efforts. Challenges include less time for new hires to be productive (62%), code breaking due to a lack of understanding of dependencies (57%), and difficulties managing changes to code (50%).
Read the full article in ZDnet: https://www.zdnet.com/article/low-and-no-code-are-wonderful-but-a-big-code-world-lurks-underneath/
Nov 04 2020
Machine Learning for Cybersecurity
Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation
Vulnerability is one of the root causes of network intrusion. An effective way to mitigate security threats is to discover and patch vulnerabilities before an attack. Traditional vulnerability detection methods rely on manual participation and incur a high false positive rate. The intelligent vulnerability detection methods suffer from the problems of long-term dependence, out of vocabulary, coarse detection granularity and lack of vulnerable samples.
This paper proposes an automated and intelligent vulnerability detection method in source code based on the minimum intermediate representation learning.
More..
Oct 26 2020
MLOps Demystified
As Machine Learning at an organization matures from research to applied enterprise solutions, there comes the need for automated Machine Learning operations that can efficiently handle the end-to-end ML Lifecycle.
The goal of level 1 MLOps (see figure) is to perform continuous training of the model by automating the entire machine learning pipeline which in turn leads to continuous delivery of prediction service. The underlying concept which empowers the continuous model training is the ability to do data version control along with efficient tracking of training/evaluation events.
- Read MLOps Demystified, by Shubham Saboo (7-8 min read)
Sep 16 2020
DevOps Market to reach $15 billion by 2026
The global DevOps market size is projected to reach $14,969.6 million by 2026, a compound annual growth rate of 19.1%, according to a Fortune Business Insights report. The report highlighted the significance of this increase, noting nearly +404% in eight years, as that market was only worth $3,708.1 million in 2018. Containerization, PaaS (Platform as a Service) and hybrid cloud are three major enablers in DevOps growth.
For more information:
- Read this DevOps market TechRepublic article
- Read another market study on DevOps from Grand View Research expecting $12.85 billion by 2025 with a close CAGR of +18.6%, or this related Medium article
Sep 09 2020
Agile Testing + DevOps = DevTestOps
"DevOps is now really DevTestOps and for teams to be truly agile, test management is the vital link in the success of DevOps. You require TestOps to match the pace of DevOps and testing early and often — breaking the silos."
In fact, the World Quality Report 2019-2020 led by Capgemini shows that there is increased investment in the QA and Test function reported by 90% of US and 69% percent of Canadian survey participants in the past four years.
Aug 10 2020
The Global Embedded Systems Market Expected to Grow +5% CAGR by 2024
An embedded system is a combination of software and hardware which together facilitate the accurate functioning of a target device. Embedded system market is expected to mark significant growth over 2019 to 2024 owing to increasing consumers spending on smart phones, providing high application- specified integrated circuit and high speed operating systems applications and technological advancement.
A recent Advance Market Analytics market study is being classified by Type (Normal Phase HPLC and Reverse Phase HPLC), by Application (Automotive, Telecommunication, Healthcare, Industrial, Consumer Electronics and Military & Aerospace) and major geographies with country level break-up. According to this study, the Global Embedded Systems market is expected to see growth rate of 5.28% and may see market size of USD536.2 Million by 2024.
Jul 06 2020
Distinct AI Techniques Bring Different Business Values
Machine learning and deep learning are often conflated by business decision makers. Machine Learning can involve a wide variety of techniques for building analytics models or decision engines that don't involve neural networks, the mechanism for deep learning. And there is a whole range of AI techniques outside of machine learning as well that can be applied to solve business problems.
Do you leverage these techniques or do you prefer computer vision and natural language processing applications to solve your business problems?
Read George Lawton article in TechTarget
Jun 29 2020
TESTAR test results extracted while executing MyThaiStar as web system under test
Authors: Fernando Pastor Ricos and Tanja E. Vos from Universitat Politècnica de València
TESTAR test results datasets extracted with TESTAR tool using MyThaiStar web application as System Under Test (SUT). These datasets have been generated to be used as an example to be automatically generated and introduced locally in DECODER PKM, from H2020 DECODER Project.
TESTAR tool is an open source tool for automated testing through graphical user interface (GUI) currently being developed by the Universitat Politecnica de Valencia and the Open University of the Netherlands.
MyThaiStar is the reference application that Capgemini uses internally to promote best programming practices and the correct use of last technologies. It’s is developed with Devon Framework, the standard tool for development at the company. More...
May 28 2020
An MLOps approach to bring models to production
Machine Learning Open Studio and Model as a Service (MaaS) from Activeeon helps data scientists and IT operations work together in an MLOps approach allowing to bring ML models to production. Machine Learning Open Studio includes automatic data drift detection mechanisms and allows traceability and audit over model performance to retrain it when necessary.
Only a small percentage of ML projects make it to production because of deployment complexity, lack of governance tools and many other reasons. Once in production, ML models often fail to adapt to the changes in the environment and its dynamic data which results in performance degradation.
To maintain the prediction accuracy of ML models in production, an active monitoring of model performance is mandatory. This allows to know when to retrain it using the most recent data and the newest implementation techniques, then redeploy in production. More...
Apr 08 2020
Algorithm and Data Structure Visualization
Visualizations can help us understand how data structures and algorithms work.
The visualgo.net website provides great visualization and animations on advanced algorithms. Most of them are discussed in 'Competitive Programming', co-authored by two brothers Dr Steven Halim and Dr Felix Halim. Today, some of these advanced algorithms visualization/animation can only be found in VisuAlgo.
An online quiz system has been added that allows students to test their knowledge of basic data structures and algorithms. It generates questions and check the student answers automatically.
Mar 12 2020
Covid-19 infection in Italy: when AI provides vital insights
Thanks to mathematical models and predictions, Gianluca Malato - a Data Scientist, fiction author and software developer - compared logistic and exponential models applied to Covid-19 virus infection in Italy. Both models help to better understand the evolution of the infection. The data preparation and python coding are detailed in an article posted in Towards Data Science on 8 March 2020. At that time, the main projections - now checked regularly by this Covid-19 Italian infection collaborative research - were:
- The expected number of infected people at infection end is 15968 +/- 4174.
- The infection peak is expected around 9 March 2020.
- The expected infection should end on 15 April 2020.
- Read the Covid-19 infection in Italy article in Towards Data Science
Feb 18 2020
Clear Linux OS automates the creation of RPM packaging
Designed by Intel and open source contributors, the Clear Linux OS delivers a secure, hardware optimized OS. Its updates ensure that software dependencies remain mutually compatible.
The autospec tool is used to assist with the automated creation and maintenance of RPM packaging in Clear Linux OS. Where a standard RPM build process using rpmbuild requires a tarball and .spec file to start, autospec requires only a tarball and package name to start.
Recent reviews confirm the performance an stability improvements of Clear Linux OS. However, software that are packaged in other formats for other Linux distributions are not guaranteed to work on Clear Linux OS and may be impacted by Clear Linux OS updates.
Jan 16 2020
The Twelve-Factor App, a Methodology for Building Web Apps
Suggested by the designers of the Heroku PaaS platform, the twelve-factor methodology can be applied to apps written in any programming language, and which use any combination of backing services (database, queue, memory cache, etc). It is aimed at building Software-as-a-Service apps that:
- Use declarative formats for setup automation, to minimize time and cost for new developers joining the project;
- Have a clean contract with the underlying operating system, offering maximum portability between execution environments;
- Are suitable for deployment on modern cloud platforms, obviating the need for servers and systems administration;
- Minimize divergence between development and production, enabling continuous deployment for maximum agility;
- And can scale up without significant changes to tooling, architecture, or development practices.
More about the Twelve-Factor App
Jan 07 2020
A New Model-Based Approach for API Testing
Keeping Pace with Agile Development, Visualizing Complex Dependencies, and Orchestrating for Completeness of Testing are three good reasons to select a Model-Based approach for API testing, according to Collin Chau, a DevOps test expert.
"With the proliferation and complexity in microservices development that the Internet of Things brings, development teams are struggling to embrace API testing for more effective QA testing in-sprint. Learn how a model-based testing approach makes the difference in your API tests."
Read Collin Chau full article in Continuous Testing
Dec 20 2019
NLP Search Paves the Way for Augmented Data Discovery
Combining natural language understanding and natural language generation will result in dynamic, bi-directional human-machine communication that will take several forms: text, voice and images. In text and voice scenarios, the BI or analytics solution can converse with the user to render the desired result - regardless of data-related and query-related search complexity.
Data visualizations also will become more interactive, if not immersive, along the lines of Busby from Oblong Industries. This product focuses on immersive interfaces, not specifically BI or analytics. However, its concepts could have a ripple effect on how people interact with data and thus, augmented data discovery.
"I think the future of BI is no BI. Don't ask me to search and look for things anymore. Give me that piece of information when I need it and if I need it. Come to me when there's something I need to know", foresees Erick Brethenoux, senior director analyst at Gartner.
For more information, read Lisa Morgan TechTarget article entitled NLP makes augmented data discovery a reality in analytics
Nov 05 2019
Is BERT a Game Changer in NLP?
BERT (Bidirectional Encoder Representations from Transformers) is an open-sourced NLP pre-training model developed by researchers at Google in 2018. It has inspired multiple NLP architectures, training approaches and language models, including Google’s TransformerXL, OpenAI’s GPT-2, ERNIE2.0, XLNet, and RoBERTa.
For instance, BERT is now used by Google Search to provide more relevant results. And it can also be used in smarter chatbots with conversational AI applications, expects Bharat S Raj.