๐Ÿง‘๐Ÿปโ€๐Ÿ’ป Software Engineer @datacebo - Working on #syntheticdata solutions @sdv_dev

Joined September 2018
45 Photos and videos
Plamen retweeted
We are introducing EU Inc. To make building and growing a business across the EU faster, simpler, and smarter. ๐Ÿ”ธ Start a company in less than 48 hours ๐Ÿ”ธ No minimum capital requirement ๐Ÿ”ธ Fully online and borderless
619
1,169
7,482
2,343,456
Plamen retweeted
Jan 21
Replying to @Dexerto
Good thing you can block YouTube Shorts with Brave. ๐Ÿฆ Hereโ€™s how to do it in our browser: Android/iOS: 1) Go to Settings -> Media 2) Enable "Block YouTube Shorts" Desktop: 1) Go to Settings -> Shields -> Content Filtering 2) Enable "YouTube Anti-Shorts"
162
581
10,913
206,105
Plamen retweeted
A free throwback MIT course breaking down how machine learning techniques can be applied to healthcare: bit.ly/3YyrGj9 (v/@MITOCW) Here, MIT prof. & CSAIL principal investigator David Sontag discusses how AI can help sort thru medical data (Lecture 1).
2
48
244
20,966
19 Sep 2025
WoW 7 Years... I think I average 1 tweet per year ๐Ÿ˜… #MyXAnniversary
1
26
6 Aug 2025
Funny how I can intuitively write scalable software, but building an IKEA cabinet turns me into a caveman discovering tools for the first time. ๐Ÿต
29
Plamen retweeted
Last week, we shared a synthetic populations dataset for the United States but this week weโ€™re sharing one published by researchers for the whole world. ๐ŸŒ Marijin Ton et alย released a gigantic synthetic population dataset that represents ~๐Ÿณ.๐Ÿฏ๐Ÿฏ ๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ต๐˜‚๐—บ๐—ฎ๐—ป๐˜€, which matches the 2015 human population count, and ~๐Ÿญ.๐Ÿต๐Ÿต ๐—ฏ๐—ถ๐—น๐—น๐—ถ๐—ผ๐—ป ๐—ต๐—ผ๐˜‚๐˜€๐—ฒ๐—ต๐—ผ๐—น๐—ฑ๐˜€. ๐—ง๐—ต๐—ฒ ๐— ๐—ผ๐˜๐—ถ๐˜ƒ๐—ฎ๐˜๐—ถ๐—ผ๐—ป To understand the impact of societal changes like disease, extreme weather, and more, modelers sometimes resort to simplifying assumptions of human behavior. According to the authors โ€“ โ€œ๐˜๐˜ฐ๐˜ณ ๐˜ฆ๐˜น๐˜ข๐˜ฎ๐˜ฑ๐˜ญ๐˜ฆ, ๐˜ช๐˜ฏ๐˜ต๐˜ฆ๐˜จ๐˜ณ๐˜ข๐˜ต๐˜ฆ๐˜ฅ ๐˜ข๐˜ด๐˜ด๐˜ฆ๐˜ด๐˜ด๐˜ฎ๐˜ฆ๐˜ฏ๐˜ต ๐˜ฎ๐˜ฐ๐˜ฅ๐˜ฆ๐˜ญ๐˜ด ๐˜ฐ๐˜ง ๐˜ค๐˜ญ๐˜ช๐˜ฎ๐˜ข๐˜ต๐˜ฆ ๐˜ค๐˜ฉ๐˜ข๐˜ฏ๐˜จ๐˜ฆ ๐˜ต๐˜บ๐˜ฑ๐˜ช๐˜ค๐˜ข๐˜ญ๐˜ญ๐˜บ ๐˜ข๐˜ด๐˜ด๐˜ถ๐˜ฎ๐˜ฆ ๐˜ข ๐˜ณ๐˜ฆ๐˜ฑ๐˜ณ๐˜ฆ๐˜ด๐˜ฆ๐˜ฏ๐˜ต๐˜ข๐˜ต๐˜ช๐˜ท๐˜ฆ ๐˜ค๐˜ฐ๐˜ฏ๐˜ด๐˜ถ๐˜ฎ๐˜ฆ๐˜ณ ๐˜ฐ๐˜ง ๐˜ข ๐˜ด๐˜ช๐˜ฏ๐˜จ๐˜ญ๐˜ฆ ๐˜ข๐˜ท๐˜ฆ๐˜ณ๐˜ข๐˜จ๐˜ฆ ๐˜จ๐˜ญ๐˜ฐ๐˜ฃ๐˜ข๐˜ญ ๐˜ฐ๐˜ณ ๐˜ณ๐˜ฆ๐˜จ๐˜ช๐˜ฐ๐˜ฏ๐˜ข๐˜ญ ๐˜ค๐˜ฐ๐˜ฏ๐˜ด๐˜ถ๐˜ฎ๐˜ฆ๐˜ณ.โ€ By creating a synthetic individuals dataset thatโ€™s consistent with published demographic statistics at the state / province level (administrative level 1) for most countries, theyโ€™re hoping to improve the data and assumptions used in global impact simulations. ๐—ง๐—ต๐—ฒ๐—ถ๐—ฟ ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ผ๐˜‚๐—ฟ๐—ฐ๐—ฒ๐˜€ The team primarily used data from 2 databases: โ€ข Luxembourg Income Study, which has very detailed microdata for 50 countries. LIS data especially shines for medium and high income countries. โ€ข Demographic and Health Surveys, which has very detailed microdata for 90 countries. DHS data especially shines for low-income countries. Households and individuals in the remaining countries were generated using regional statistics. A small number of countries were excluded that were missing reliable, published statistics. This is a great dataset to explore geospatial visualizations or to build regional or global impact models. ๐Ÿ“š Link to the paper: nature.com/articles/s41597-0โ€ฆ ๐Ÿ—„๏ธ Link to the dataset: dataverse.harvard.edu/dataseโ€ฆ #syntheticdata #machinelearning #generativeai Kudos to researchers who made this happen: Michiel Ingels, Jens de Bruijn, Hans de Moel, Lena Reimann, Wouter Botzen, Jeroen Aerts Credit to the Nature Magazine and the authors for the image showcasing the population coverage and data source for each country.
1
2
110
Plamen retweeted
In 2024, synthetic data routinely made headlines alongside many AI product launches. ๐—›๐—ฒ๐—ฟ๐—ฒ ๐—ฎ๐—ฟ๐—ฒ ๐—ผ๐˜‚๐—ฟ ๐—ฝ๐—ฟ๐—ฒ๐—ฑ๐—ถ๐—ฐ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ณ๐—ผ๐—ฟ ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐Ÿ”ฎ ๐Ÿญ. ๐—ง๐—ต๐—ฒ ๐—ฟ๐—ถ๐˜€๐—ฒ ๐—ผ๐—ณ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—”๐—œ ๐˜„๐—ถ๐—น๐—น ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜ ๐—ถ๐—ป ๐—ฎ ๐—ป๐˜‚๐—บ๐—ฏ๐—ฒ๐—ฟ ๐—ผ๐—ณ ๐—Ÿ๐—Ÿ๐— -๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐˜๐—ผ๐—ผ๐—น๐˜€ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ฎ๐—ฏ๐˜‚๐—น๐—ฎ๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ. ๐—ก๐—ผ๐—ป๐—ฒ ๐˜„๐—ถ๐—น๐—น ๐—ฑ๐—ฒ๐—น๐—ถ๐˜ƒ๐—ฒ๐—ฟ ๐—ผ๐—ป ๐˜๐—ต๐—ฒ ๐—ฝ๐—ฟ๐—ผ๐—บ๐—ถ๐˜€๐—ฒ, ๐—ฏ๐˜‚๐˜ ๐˜๐—ต๐—ถ๐˜€ ๐—ฝ๐—ฟ๐—ผ๐—ฐ๐—ฒ๐˜€๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ต๐—ฒ๐—น๐—ฝ ๐—ฒ๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ๐˜€ ๐—ฑ๐—ฒ๐—ณ๐—ถ๐—ป๐—ฒ ๐—ฟ๐—ฒ๐—พ๐˜‚๐—ถ๐—ฟ๐—ฒ๐—บ๐—ฒ๐—ป๐˜๐˜€. Researchers have started to use LLMโ€™s to generate synthetic tabular data. We predict that these efforts will show promise on toy or single-table datasets but will fall short for complex, enterprise-grade, multi-table databases that contain lots of hidden context. Even though these tools will be tested and will fail to deliver ... it will lead to the development of much more concrete requirements for tabular synthetic data generators. ๐Ÿฎ. ๐—–๐—ผ๐—บ๐—ฝ๐—ฎ๐—ป๐—ถ๐—ฒ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ณ๐—ฎ๐—ฐ๐—ฒ ๐—ฎ ๐—ณ๐—ฟ๐—ฒ๐—ฒ๐˜‡๐—ฒ ๐—ถ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ฎ๐˜€๐˜€๐—ฒ๐˜ ๐—ฎ๐˜ƒ๐—ฎ๐—ถ๐—น๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜† ๐—ฑ๐˜‚๐—ฒ ๐˜๐—ผ ๐—ฟ๐—ฒ๐—ด๐˜‚๐—น๐—ฎ๐˜๐—ถ๐—ผ๐—ป๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฑ๐—ฒ๐—ฐ๐—น๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—ฐ๐˜‚๐˜€๐˜๐—ผ๐—บ๐—ฒ๐—ฟ ๐—ฐ๐—ผ๐—ป๐˜€๐—ฒ๐—ป๐˜. Increased privacy and security regulations and increased customer privacy consciousness will make it harder to use customer data to train AI models. This will lead companies to run out of usable data and turn to synthetic data as a viable solution. ๐Ÿฏ. ๐—˜๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐—ฐ๐—ผ๐—บ๐—ฝ๐—ฎ๐—ป๐˜† ๐˜„๐—ถ๐—น๐—น, ๐—ฎ๐˜ ๐˜๐—ต๐—ฒ ๐˜ƒ๐—ฒ๐—ฟ๐˜† ๐—น๐—ฒ๐—ฎ๐˜€๐˜, ๐—ฒ๐˜…๐—ฝ๐—ฒ๐—ฟ๐—ถ๐—บ๐—ฒ๐—ป๐˜ ๐˜„๐—ถ๐˜๐—ต ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ถ๐—ป ๐Ÿฎ๐Ÿฌ๐Ÿฎ๐Ÿฑ ๐—ฎ๐˜€ ๐—ฝ๐—ฎ๐—ฟ๐˜ ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ๐—ถ๐—ฟ ๐—ฏ๐—ฟ๐—ผ๐—ฎ๐—ฑ๐—ฒ๐—ฟ ๐—”๐—œ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ด๐˜†. Synthetic data is often better than real data in AI training and can be more freely shared across the organization. AI models simply perform better when trained with upsampled, augmented, and bias-corrected synthetic data as they can identify patterns more efficiently without overfitting. We are already seeing this โ€” the SDV software has been downloaded more than 7 million times, and as many as 10% of global Fortune 500 companies currently experiment with SDV. We predict this number will grow exponentially next year. ๐Ÿฐ. ๐—ฆ๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ณ๐—ผ๐—ฟ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ฏ๐—ฒ๐—ฐ๐—ผ๐—บ๐—ฒ ๐—ฎ ๐—บ๐—ผ๐—ฟ๐—ฒ ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ป๐—ด ๐—ป๐—ฒ๐—ฒ๐—ฑ. Enterprises will need additional data to train more robust AI agents and synthetic data can help fill the gap. ๐Ÿฑ. ๐—˜๐—ป๐˜๐—ฒ๐—ฟ๐—ฝ๐—ฟ๐—ถ๐˜€๐—ฒ๐˜€ ๐˜„๐—ถ๐—น๐—น ๐—ด๐—ฎ๐—ถ๐—ป ๐—ฏ๐—ถ๐—ด ๐—ณ๐—ฟ๐—ผ๐—บ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐˜๐—ฎ๐—ฏ๐˜‚๐—น๐—ฎ๐—ฟ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐—ฎ๐—ป๐—ฑ ๐˜€๐˜†๐—ป๐˜๐—ต๐—ฒ๐˜๐—ถ๐—ฐ ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜๐—ผ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป ๐—”๐—œ ๐—ฎ๐—ด๐—ฒ๐—ป๐˜๐˜€. While big tech focuses on improving LLMโ€™s, most enterprises will gain more immediate value from synthetic tabular data to improve data access, train more robust ML models, or train better AI agents. ๐Ÿ“– Read more about our 2025 predictions and our 2024 recap here: datacebo.com/blog/synthetic-โ€ฆ #generativeai #ai #openai #syntheticdata #machinelearning

2
3
92
Plamen retweeted
15 Oct 2024
Searching for files in large projects has never been easier ๐Ÿš€ Check out the real-time search results for the #vscode project, which has thousands of files (and yes, we use @code to write @code). You can easily switch between fuzzy matching and continuous matching too!
26
145
2,214
146,091
Plamen retweeted
17 Sep 2024
One of our users exclaimed "These speedups are insane!" Our multi table synthesizer in SDV Enterprise, called HSA Synthesizer, runs in less than 1 minute what takes HMA Synthesizer an hour - across 20 datasets. โ‡๏ธ We have been focusing on multi table synthesizers. #syntheticdata platform must address the complexity of multi table enterprise data at scale. ๐Ÿ”ฅ The 70x speeds fundamentally change how one uses #SDV. If you can model that fast and sample even faster the need to save model and version it goes away. โœ… What is more interesting is that these speed ups have not been achieved by increasing the compute required, but fundamentally changing the algorithms. We are continuously evolving and more to come. You can learn more about the trade offs in this blog: datacebo.com/blog/multi-tablโ€ฆ #syntheticdata, #generativeai, #performance -- @sdv_dev

1
5
101
Plamen retweeted
5 Sep 2024
#OTD in 2016 we submitted the final camera ready version of the Massachusetts Institute of Technology paper โญ๏ธ The synthetic data vault โญ๏ธ The paper said: "This synthetic data must meet two requirements: 1๏ธโƒฃ First, it must somewhat resemble the original data statistically, to ensure realism and keep problems engaging for data scientists. 2๏ธโƒฃ Second, it must also formally and structurally resemble the original data, so that any software written on top of it can be reused. In order to meet these requirements, the data must be statistically modeled in its original form, so that we can sample from and recreate it. In our case and in most cases, that form is the database itself. Thus, modeling must occur before any transformations and aggregations are applied." Today, #sdv counts millions of downloads, thousands of users and so many additional modules have been added to evaluate #syntheticdata, #benchmark models and so much more.. You can find the original paper here: dai.lids.mit.edu/wp-content/โ€ฆ #syntheticdata, #generativeai, #tabulardata , #ai, #machinelearning, #DataScience
3
5
260
Plamen retweeted
4 Sep 2024

2
2
161
Plamen retweeted
30 Apr 2024
Excellent article in @Forbes today calling #syntheticdata โ€œan all-too-rare example ofโ€ฆgenuinely usefulโ€ generative AI, for the particular application of software testing. Read @jpwarren profile of @datacebo and @kveeramac : forbes.com/sites/justinwarreโ€ฆ #bigdata #syntheticdata #generativeai #data #datascience #enterprisedata #tabulardata #predictiveAI #machinelearning #ML #generativemodels #MLmodels #productiontesting #softwaretesting #cybersecurity #hacking #infosec #security
4
7
739
Plamen retweeted
Check out @kveeramac full interview on the Cloudcast podcast w/@aarondelp & @bgracely, where @kveeramac talks about how @datacebo is leading the #syntheticdata efforts: thecloudcast.net/search/labeโ€ฆ

1
4
211
Plamen retweeted
Congratulations @XuLeonard for having his paper on generating #syntheticdata using conditional GAN 1000 citations!
3 Apr 2024
Really excited to announce that our NeurIPS 2019 paper on 'Modeling tabular data using conditional GAN' has surpassed 1k citations! It's inspiring to see researchers applying the model innovatively in the era of LLMs. #NeurIPS #GAN #SyntheticData #MIT
1
2
175
Plamen retweeted
7 Dec 2023
DataCebo launches enterprise version of popular open source synthetic data library tcrn.ch/47LGduY by @ron_miller

4
4
17,831
7 Dec 2023
DataCebo launches enterprise version of popular open source synthetic data library tcrn.ch/3Rf7wqk via @techcrunch

48
Plamen retweeted
25 Sep 2023
If you like cats, you might like the oneko command!
11
56
412
26,421
Plamen retweeted
13 Sep 2023
Tech tycoons with a combined net worth of roughly $550 billion gathered in the same room Wednesday for a Senate forum on the future and regulation of AI bloom.bg/3RlmAV3
150
458
2,079
2,106,753
Plamen retweeted
Very sad news that Bram Moolenaar creator of VIM died 3 August 2023. The official family message is linked below. Consider donating to ICCF Holland in his memory: trib.al/TtGawNt trib.al/7WPRxem
16
175
679
40,068