Leveraging AI technology to enhance software testing | MIT News

Innovative AI is receiving ample attention for its capacity to produce text and visuals. However, these media only represent a fraction of the information that is abundant in our modern society. Information is produced every time a patient undergoes a medical procedure, a flight is affected by a storm, or an individual engages with a software application.

Utilizing innovative AI to generate lifelike artificial data around these scenarios can aid organizations in more efficiently caring for patients, redirecting planes, or enhancing software platforms — particularly in situations where real-world data are scarce or sensitive.

Over the past three years, the MIT spinout DataCebo has introduced an innovative software system known as the Artificial Data Vault to assist organizations in generating artificial data for purposes such as testing software applications and training machine learning models.

The Artificial Data Vault, or SDV, has been downloaded over 1 million times, with over 10,000 data scientists utilizing the open-source library to generate artificial tabular data. The creators — Principal Research Scientist Kalyan Veeramachaneni and alumna Neha Patki ’15, SM ’16 — are convinced that the success of the company is attributed to SDV’s ability to revolutionize software testing.

SDV becomes popular

In 2016, Veeramachaneni’s team in the Data to AI Lab introduced a series of open-source innovative AI tools to help organizations create artificial data that matched the statistical properties of genuine data.

Organizations can employ artificial data instead of confidential information in programs while still maintaining the statistical connections between data points. Additionally, companies can utilize artificial data to subject new software to simulations to assess its performance before its public release.

Veeramachaneni’s team encountered this issue when collaborating with firms that wanted to share their data for research purposes.

“MIT exposes you to a variety of uses,” Patki elucidates. “You collaborate with financial companies and healthcare companies, and all of these projects are beneficial in formulating solutions across various sectors.”

In 2020, the researchers established DataCebo to develop more SDV features for larger organizations. Since then, the applications have been as remarkable as they have been diverse.

For instance, with DataCebo’s new flight simulator, airlines can strategize for uncommon weather conditions in a manner that would be unachievable using solely historical data. In another instance, SDV users synthesized medical records to anticipate health outcomes for patients with cystic fibrosis. A team from Norway recently utilized SDV to generate artificial student data to evaluate the fairness and impartiality of various admission policies.

In 2021, the data science platform Kaggle organized a competition for data scientists that employed SDV to create artificial data sets to avoid using proprietary data. Roughly 30,000 data scientists participated, developing solutions and projecting outcomes based on the company’s authentic data.

And as DataCebo expands, it remains committed to its MIT origins: all of the company’s present employees are MIT alumni.

Enhancing software testing

Despite the fact that their open-source tools are applied in various scenarios, the company is prioritizing the expansion of its influence in software testing.

“Software applications necessitate data for testing,” Veeramachaneni highlights. “Conventionally, developers create artificial data manually by scripting. With generative models, produced using SDV, you can learn from a data sample collected and then generate a large volume of artificial data (which mirrors real data properties), or generate specific scenarios and exceptions, and utilize the data for application testing.”

For instance, if a bank wanted to evaluate a program designed to decline transfers from accounts with insufficient funds, it would need to simulate numerous accounts performing transactions concurrently. Accomplishing this with manually curated data would be time-consuming. With DataCebo’s generative models, customers can fabricate any exceptional case they wish to test.

“It is customary for industries to possess data that is sensitive to a certain extent,” Patki remarks. “Frequently, in a domain with confidential data, there are regulations to adhere to, and even in the absence of legal regulations, it is in the companies’ best interest to be cautious about who accesses what and when. Therefore, artificial data always proves superior from a privacy standpoint.”

Expanding artificial data

Veeramachaneni is convinced that DataCebo is advancing the realm of what it terms synthetic enterprise data — data produced from user interactions on large companies’ software applications.

“This kind of enterprise data is intricate, and it is not universally accessible, unlike language data,” Veeramachaneni remarks. “When individuals employ our publicly accessible software and report on its compatibility with a particular pattern, we acquire knowledge about these distinct patterns, allowing us to enhance our algorithms. From a certain perspective, we are assembling a collection of these intricate patterns, which is readily obtainable for language and imagery data.”

DataCebo also recently launched features to enhance the utility of SDV, including tools to evaluate the ‘realism’ of the generated data, known as the SDMetrics library, as well as a method to compare models’ performances called SDGym.

“It is about ensuring organizations place their trust in this fresh data,” Veeramachaneni emphasizes. “[Our tools provide] programmable artificial data, giving enterprises the ability to infuse their distinct insights and intuitions to construct more transparent models.”

As companies across every sector rush to embrace AI and other data science tools, DataCebo is fundamentally aiding them in doing so in a manner that is more clear and accountable.

“In the forthcoming years, artificial data produced by generative models will transform all data operations,” Veeramachaneni forecasts. “We are of the opinion that 90 percent of enterprise operations can be carried out using artificial data.”

Leave a Reply

Your email address will not be published. Required fields are marked *