An MIT spin-off that's making synthetic data a reality.

Joined October 2020
42 Photos and videos
Pinned Tweet
30 Apr 2024
Excellent article in @Forbes today calling #syntheticdata โ€œan all-too-rare example ofโ€ฆgenuinely usefulโ€ generative AI, for the particular application of software testing. Read @jpwarren profile of @datacebo and @kveeramac : forbes.com/sites/justinwarreโ€ฆ #bigdata #syntheticdata #generativeai #data #datascience #enterprisedata #tabulardata #predictiveAI #machinelearning #ML #generativemodels #MLmodels #productiontesting #softwaretesting #cybersecurity #hacking #infosec #security
4
7
739
27 Mar 2025
SDV Enterprise v0.24.0 is out ๐ŸŽ‰ This release adds features that help you generate higher quality synthetic data and improve ease-of-use. ๐ŸŒŸ Model hierarchical relationships in a table. Use the SelfReferentialHierarchy CAG pattern when you have a column in a table that references the same table. This represents a hierarchical relationship between the rows. ๐Ÿ“ฆ Program your synthesizers with bulk updates. Update the data preprocessing for many columns at once using our bulk update function. This is compatible with any of the preprocessing transformers in the RDT library. ๐Ÿ“š Read the full Release Notes here: docs.sdv.dev/sdv-enterprise/โ€ฆ ๐Ÿ“š Learn more about the SDV: sdv.dev/ #syntheticdata #generativeai #machinelearning #ai
1
105
26 Feb 2025
Today, weโ€™re excited to introduce a powerful new bundle to the @sdv_dev: AI connectors. AI connectors address 2 key challenges that SDV users face when training generative AI models on datasets from enterprise data stores. (Link to the announcement: bit.ly/3EURLCB) โŽ Creating accurate metadata is time consuming, especially for complex multi-table schemas Metadata provides a deeper context (semantic and statistical) about your data and the synthesizers use this context to generate high quality synthetic data. Without AI connectors, SDV users have to export data from the database, use SDVโ€™s metadata auto-detection feature to establish metadata, and then manually update the metadata to be accurate. โœ… AI Connectors automatically generate higher quality metadata AI connectors automatically infers higher quality metadata using the database schema and our own inference engine, without having to read tables into memory from the database. When benchmarked with 55 datasets stored in 4 different database platforms, metadata generated using AI connectors resulted in 35% higher quality metadata (average score of 0.98) compared to metadata generated using the auto-detection approach (average score of 0.73). โŽ Identifying a referentially sound and representative sample for training data is tricky Training SDV Synthesizers requires loading a representative sample of data from your database into memory. In addition, the data needs to have referential integrity for the synthesizers to learn the proper relationships. Approaches to identifying a high quality, referentially sound sample of data can be tedious and time-consuming to implement. โœ… AI Connectors uses an inbuilt algorithm to generate a training data set and guarantee referential integrity With AI connectors, we created an algorithm called Referential First Search (RFS) that guarantees that the real data used to train the model is a subset with referential integrity. When benchmarked with 7 datasets stored in 5 different databases, training data created using AI connectors achieved an average of 18% higher quality data score over the standard approach of random subsampling and then enforcing referential integrity after. Read more about AI connectors and how to access it in our latest product announcement here: bit.ly/3EURLCB #syntheticdata #generativeai #machinelearning #databases
1
83
20 Feb 2025
SDV Enterprise v0.23.0 is out ๐ŸŽ‰ This release enhances your ability to program your synthesizer to find certain patterns and recreate themโ€” whether it's through multi-table CAG patterns, single-table constraints, or pre-processing techniques that transform your data. ๐Ÿ† Improved CAG patterns. Use CarryOverColumns to specify a column that is repeated across many tables with different relationships. The PrimaryToPrimaryKeySubset pattern now works with missing values. See more about these interesting data patterns SDV Enterprise supports in the slides below. ๐Ÿ’ก Experiment with new transformers to improve your synthetic data quality. Try applying the new LogScaler and LogitScaler on data that exhibits exponential properties. ๐Ÿ“š Read the full Release Notes here: bit.ly/4152LVn ๐Ÿ“š Learn more about the SDV: bit.ly/4b858Lu #syntheticdata #generativeai #machinelearning #ai
1
68
20 Dec 2024
Today, we are excited to introduce a very powerful new framework to The Synthetic Data Vault : ๐—ฐ๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐˜ ๐—ฎ๐˜‚๐—ด๐—บ๐—ฒ๐—ป๐˜๐—ฒ๐—ฑ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป (#CAG for short). CAG addresses the shortcomings of generative models in capturing the context buried in enterprise data stores - with human input. (Link to the announcement: datacebo.com/announcements/iโ€ฆ) โŽ ๐—š๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—”๐—œ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€ ๐—ณ๐—ฎ๐—ถ๐—น ๐˜๐—ผ ๐—ฐ๐—ฎ๐—ฝ๐˜๐˜‚๐—ฟ๐—ฒ ๐—ฑ๐—ฒ๐˜๐—ฒ๐—ฟ๐—บ๐—ถ๐—ป๐—ถ๐˜€๐˜๐—ถ๐—ฐ ๐—ฟ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€๐—ต๐—ถ๐—ฝ๐˜€ ๐—ฏ๐—ฒ๐˜๐˜„๐—ฒ๐—ฒ๐—ป ๐—ฐ๐—ผ๐—น๐˜‚๐—บ๐—ป๐˜€, ๐—ฟ๐—ผ๐˜„๐˜€, ๐—ฎ๐—ป๐—ฑ ๐˜๐—ฎ๐—ฏ๐—น๐—ฒ๐˜€. We call such relationships database context. Database context describes hard and fast rules under which data is created and stored. What is even harder is that usually, this context is not explicitly stored within the database schema itself โ€“ but data teams know that it exists. Downstream applications process this data based on the context using logic within the application software. When the generative models are used to create #syntheticdata the expectation is that the #syntheticdata will also follow the database context. โœ… When we launched The Synthetic Data Vault โ€” a system to enable enterprises to build generative models for their own #multitable data โ€” we provided the ability to include context via what we called #๐—ฐ๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐˜๐˜€. ๐Ÿ”ฅ Over the years, ๐—ฐ๐—ผ๐—ป๐˜€๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐˜๐˜€ ๐—ต๐—ฎ๐˜€ ๐—ฏ๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ผ๐—ป๐—ฒ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐—บ๐—ผ๐˜€๐˜ ๐—ฝ๐—ผ๐—ฝ๐˜‚๐—น๐—ฎ๐—ฟ ๐—ณ๐—ฒ๐—ฎ๐˜๐˜‚๐—ฟ๐—ฒ๐˜€ ๐—ผ๐—ณ ๐—ผ๐˜‚๐—ฟ ๐—ฆ๐——๐—ฉ ๐—˜๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜. ๐Ÿ’ช ๐—ช๐—ถ๐˜๐—ต ๐—–๐—”๐—š ๐˜„๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ฑ๐—ผ๐˜‚๐—ฏ๐—น๐—ถ๐—ป๐—ด ๐—ฑ๐—ผ๐˜„๐—ป ๐—ผ๐—ป ๐˜๐—ต๐—ถ๐˜€ ๐—ณ๐—ผ๐—ฐ๐˜‚๐˜€. To use this new and powerful framework, users can just pick the pre-defined pattern that corresponds to their database context and tell SDV where to apply it. It will then augment your synthesizer directly with this information. And 100% valid #syntheticdata ๐—ฅ๐—ฒ๐—ฎ๐—ฑ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ฎ๐—ฏ๐—ผ๐˜‚๐˜ ๐—–๐—”๐—š, ๐˜„๐—ต๐—ฎ๐˜ ๐—ถ๐˜ ๐—บ๐—ฒ๐—ฎ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐˜†๐—ผ๐˜‚, ๐—ฎ๐—ป๐—ฑ ๐—ต๐—ผ๐˜„ ๐˜๐—ผ ๐—ฎ๐—ฐ๐—ฐ๐—ฒ๐˜€๐˜€ ๐—ถ๐˜ ๐—ถ๐—ป ๐—ผ๐˜‚๐—ฟ ๐—น๐—ฎ๐˜๐—ฒ๐˜€๐˜ ๐—ฝ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜ ๐—ฎ๐—ป๐—ป๐—ผ๐˜‚๐—ป๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—ต๐—ฒ๐—ฟ๐—ฒ: datacebo.com/announcements/iโ€ฆ ๐—›๐—ฎ๐—ฝ๐—ฝ๐˜† ๐—ต๐—ผ๐—น๐—ถ๐—ฑ๐—ฎ๐˜†๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฒ๐—ป๐—ท๐—ผ๐˜† ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜€๐—ถ๐˜‡๐—ถ๐—ป๐—ด! - from all of DataCebo Team #syntheticdata #generativeai #data #machinelearning #ml #ai
1
1
4
130
14 Nov 2024
Working with customers all over the world has taught us about one important, but often overlooked benefit of using #syntheticdata: increased data diversity. Data diversity refers to the overall variety of data that is accessible for a project. While it's a simple concept, increasing data diversity can deliver enormous value to an organization โ€” allowing for more robust testing, better predictions, and even higher creativity. In our latest blog post, (link: lnkd.in/eumZtB38) Neha Patki elaborates this concept using some concrete examples. In the blog we cover: โœณ๏ธ How #generativeai models in #SDV (@sdv_dev ) can create #syntheticdata that is diverseย ? โœ… How generative models maintain the delicate balance between creating novel (diverse) data and preserving high degree of statistical resemblance to real data โŽ While we can create really diverse data using random data generators, they wonโ€™t resemble the real data. Also in this blog are examples of how data diversity as generated by #syntheticdata impacts the outcomes.ย (Links to each of these case studies are in the blog as well) Diverse synthetic data enables creation of a more robust product. ING creates synthetic financial transactions with 100x the combinations present in their real data. This allows them to thoroughly test all aspects of a complicated payments service, and keep their payments systems working. Diverse #syntheticdata can help make better predictions. A research team at UCLA (@UCLA ) created synthetic credit card fraud events, which combined different characteristics of real fraud events into rarer occurrences. The synthetic data allowed them to better predict future credit card fraud by nearly 20x. It allows product development teams to navigate new ideas. One promising new direction for research and development teams involves using synthetic data to invent brand new products by combining attributes of existing ones. Link to the blog: datacebo.com/blog/synthesizeโ€ฆ
2
3
270
13 Nov 2024
๐Ÿš€๐Ÿ”ฅ #CTGAN has been downloaded over 2.5 million times.ย ๐Ÿ”ฅ๐Ÿš€ Released #thisweek in 2019: version 0.1.0 of #CTGAN as part of The Synthetic Data Vault, a Deep Learning-based #syntheticdata generator for single-table data that can learn from real data and generate synthetic data with high fidelity. During this time: ๐Ÿ™Œ It continues to be the go-to model for many #fortune500 companies who want to create #syntheticdata to train robust #AI models ๐Ÿ‘ It has been used for a wide variety of use cases in the domains ranging from #energy, #healthcare, #education, #insurance and many others ๐Ÿ”ฅIt has been used to create #syntheticdata for data science competitions, to improve predictive accuracy of healthcare models, and to accurately predict fraud, to name a few. ๐Ÿค Data created using #CTGAN has been used by more than 30,000 data science teams. โค๏ธ Thank you to all our users who used it and gave a ton of feedback which has helped us build it further and further. With its demand surpassing any other generative AI model for tabular data, we will be releasing more features for CTGAN in the near future. Check it out here: docs.sdv.dev/sdv/single-tablโ€ฆ #syntheticdata #datascience #dataanalytics #DS #sdv Happy synthesizing! - The DataCebo Team
1
2
194
6 Nov 2024
Upon popular demand we have added the ability to connect to databases to bring data to The Synthetic Data Vault (@sdv_dev ). Users can now directly connect #SDV Enterprise to their databases, both to import real data and to export #syntheticdata. We have added #bigquery and #mssql and many more are in the pipeline. With this new feature users can: โžก๏ธ ๐™ธ๐š–๐š™๐š˜๐š›๐šโ€‚๐š๐šŠ๐š๐šŠ from a database โœ… ๐™ฒ๐š›๐šŽ๐šŠ๐š๐šŽโ€‚๐š–๐šŽ๐š๐šŠ๐š๐šŠ๐š๐šŠ for #generativeai modeling automatically โœ”๏ธ ๐™ธ๐š–๐š™๐š˜๐š›๐šโ€‚๐šŠโ€‚๐š˜๐š™๐š๐š’๐š–๐š’๐šฃ๐šŽ๐šโ€‚๐šœ๐šž๐š‹๐šœ๐šŽ๐š for modeling โžก๏ธ ๐™ด๐šก๐š™๐š˜๐š›๐šโ€‚๐šœ๐šข๐š—๐š๐š‘๐šŽ๐š๐š’๐šŒโ€‚๐š๐šŠ๐š๐šŠ to the database This feature is in Beta โ€” try it out and let us know what you think. Link to the documentation: docs.sdv.dev/sdv/multi-tableโ€ฆ
66
2 Nov 2024
#otd in 1998 Yann LeCun (@ylecun) submitted a paper on gradient-based deep learning for document recognition. It took more than a decade before the world finally warmed to neural networks. He has since had his paper cited roughly 70,000 times, and in 2018 won the Turing Award, widely viewed as "the Nobel Prize of computing." The original paper: yann.lecun.com/exdb/publis/pโ€ฆ
213
9 Oct 2024
๐Ÿ† We are pleased to share that DataCebo has been awarded a contract by the U.S. Department of Homeland Securityโ€™s (@DHSgov ) under the call for a Synthetic Data Generator. With The Synthetic Data Vault (@sdv_dev ) the DHS will be able to build, deploy, and manage sophisticated generative AI models to generate high-quality synthetic data to: โœ… Test new applications and services with synthetic operational data. ๐Ÿ” Simulate impacts on cyber-physical systems without requiring access to the system or live data. ๐Ÿงจ Create training data for ML when real world data is unavailable, restricted, or cost prohibitive. We look forward to contributing to mission critical systems pertaining to national security, and collaborating with the DHS! Link to the press release: dhs.gov/science-and-technoloโ€ฆ #syntheticdata #generativeai #SVIP #dhs #sdv
1
4
205
2 Oct 2024
Born #otd in 1950: the Turing Test.ย  Alan Turing's paper from 74 years ago describes a modified version of the "imitation game" in which a human judge has to determine which of two typing partners is a computer. June 2024: In related news, one recent study found that human subjects judged GPT-4 to be human more than half the time: livescience.com/technology/aโ€ฆ August 2024: Consider these two sample tabular datasets. One is generated using @sdv_dev and the other is not which one is real and which one is synthetic (top or bottom)? The original paper from turing: academic.oup.com/mind/articlโ€ฆ #syntheticdata #generativeai #tabulardata #sdv
2
80
17 Sep 2024
One of our users exclaimed "These speedups are insane!" Our multi table synthesizer in SDV Enterprise, called HSA Synthesizer, runs in less than 1 minute what takes HMA Synthesizer an hour - across 20 datasets. โ‡๏ธ We have been focusing on multi table synthesizers. #syntheticdata platform must address the complexity of multi table enterprise data at scale. ๐Ÿ”ฅ The 70x speeds fundamentally change how one uses #SDV. If you can model that fast and sample even faster the need to save model and version it goes away. โœ… What is more interesting is that these speed ups have not been achieved by increasing the compute required, but fundamentally changing the algorithms. We are continuously evolving and more to come. You can learn more about the trade offs in this blog: datacebo.com/blog/multi-tablโ€ฆ #syntheticdata, #generativeai, #performance -- @sdv_dev

1
5
101
16 Sep 2024
In 1956, to store 5MB it required a hard disk that weighed a ton. In 2024 a #generativeai model can capture the salient properties of terabytes of data in an entire database within a single file and recreate it on demand - what we call #syntheticdata. #otd in 1956 IBM launched the first commercial hard-disk drive, the Model 350 RAMAC, which weighed a ton and stored the equivalent of roughly 5 MB. In comparison, today's largest commercial hard drive - Seagate's Exos X Mozaic - has 6 million times more space, at 30TB. And โ€ฆ in 2024 with generative AI: Now a generative model of a file size of a few GBs can capture the salient properties of the data and recreate 30TBs of #syntheticdata with the same statistical properties and that looks like the real data on-the-fly Read more about the original article about IBMs first hard disk here: storagenewsletter.com/2011/0โ€ฆ
2
83
9 Sep 2024
Happy birthday to the late Dennis Ritchie, inventor of C and co-creator of Unix. C and C have played a key role in the big data revolution, having been the origin languages for some of the core components of popular ML libraries, including #PyTorch and #TensorFlow. Multics,ย  the original project, had started in the mid-1960s as a time sharing operating system. Ken Thompson, Dennis Ritchie, ย Douglas McIlroy, and Joe Ossanna branched out and decided to reimplement a much simpler version of the project which became #unix. More about Ritchie's legacy in ZDNET: zd.net/2creeZi

94
5 Sep 2024
#OTD in 2016 we submitted the final camera ready version of the Massachusetts Institute of Technology paper โญ๏ธ The synthetic data vault โญ๏ธ The paper said: "This synthetic data must meet two requirements: 1๏ธโƒฃ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists. 2๏ธโƒฃ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused. In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied." Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more.. You can find the original paper here: dai.lids.mit.edu/wp-content/โ€ฆ #syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #DataScience
3
5
260
28 Aug 2024
Launched 25 years ago this summer: VMware 1.0, the first commercial product that allowed users to run multiple operating systems as virtual machines on a single x86 machine. Later known as VMware Workstation, it was an influential application that provided a framework for cloud compute instances and other infrastructure resources used in early cloud services. #techhistory #cloudcomputing #bigdata
1
1
88