Nowadays, applications of AI and ML are everywhere, spanning various industries from healthcare and finance to insurance, households (IoT), and energy applications. The training of AI models is inherently reliant on extensive datasets. In the context of the energy industry, crucial datasets, such as smart meter data, play a significant role in the development of consumption forecasts, demand response models, and retail pricing modeling.
Software companies, utilities, traders, research institutions and TSOs/DSOs are interested in obtaining realistic datasets for developing such models. Nevertheless, privacy laws, such as GDPR, impose restrictions on data exchange. The process of data anonymization and the creation of realistic, yet artificial, datasets for training AI models is a complex and time-consuming task. Moreover, there is a risk of losing correlations if not executed professionally, and immature methodologies may result in the high re-identification quote of individual data from anonymized datasets.
BlueGen’s software presents a solution to this challenging task. The methodology, derived from collaborative research with Technical University of Delft is scientifically proven. The company employs AI models to learn from real samples and create datasets with precisely the same behavior, ensuring that statistical properties of the original dataset are retained. These synthetic datasets can be shared across locations and companies without being subject to privacy laws. Additionally, the dataset can be re-generated and data volumes increased with just a mouse click. Due to the decentralized learning of AI algorithms, the original datasets containing private data can remain on-premises and will not be moved or shared.
BlueGen ‘s platform not only generates synthetic datasets for use in AI applications but also provides reports showing the quality of the created data in terms of resemblance, utility and privacy. In this context, resemblance serves as a metric to quantify proximity of statistical distribution of the synthetic and the original data set. Utility is measuring performance of downstream ML models when working with synthetic instead of real data. The privacy measure shows the degree of protection the data has against leakage attacks.
Classical anonymization techniques which involve modification of existing data usually keep 1:1 relationship between the modified and original data rows. This inherent re-identification possibility is no longer deemed acceptable in contemporary data privacy standards. The criteria for scoring the quality of synthetic data adhere to the guidelines outlined in Article 29 of the Data Protections Working Party, ensuring compliance with privacy standards. BlueGen demonstrates high scores for all the components according to BlueGen’s Chief Product Officer Vincent Campfens. Additionally, involvement of the company in IEEE reflects a commitment to standards and practices endorsed by the Institute of Electrical and Electronics Engineers.
I had the opportunity to connect with Vincent Campfens at ENLIT and he shared some energy-related business cases. For instance, a large European utility improved the accuracy of its consumption forecast using synthetic datasets created by BlueGen, as real data could only be stored for two historical years due to privacy regulations and this was insufficient for a reliable forecast. By creating syntactic data year by year, the company is empowered to preserve statistical behavior of the long history of data and use it for forecasting without breaking the privacy regulations.
Another crucial use case involves the utilization of smart meter data for different demand response programs and grid reliability applications. The privacy-safe synthetic datasets empower DSOs to share data with other market participants and software companies using the data for their model developments. This opens up an opportunity for DSOs to establish a new business line by providing high-quality synthetic meter data to technology companies and academic researchers. The realization of this potential hinges on the proactive initiative of these institutions to articulate the demand and capitalize on the data offering.