Prompt2Data: Synthetic Data Generator Agent

ashishprmodi · ‎04-08-2026

The Problem: Getting Realistic Data Shouldn't Be Hard

Every data engineer, analyst, or application developer has been in this spot, you need a realistic dataset to prototype a dashboard, test a pipeline, train a model or demo a feature but real data is locked behind compliance reviews, access requests, and privacy constraints.

You could write a quick Python script with `random.randint()` but the result looks nothing like production data, no correlations between columns, no seasonal trends, no referential integrity across tables, no realistic distributions.

What if you could just describe the data you need in plain English and have a fully normalized, multi-table dataset generated for you in seconds with visualizations, statistical summaries, and CSV exports included?

That is exactly what Prompt2Data does.

What Is Prompt2Data?

Prompt2Data is a custom GitHub Copilot agent built using VS Code's agent customization framework. It takes a natural-language prompt and produces synthetic data. What makes this really powerful is how simple the experience is you just describe your requirement in plain English and the agent takes care of everything end-to-end.

A dedicated project folder with a descriptive name derived from your prompt
A fully-executed Jupyter notebook containing all data generation code
Multiple normalized CSV files — one per domain entity — with proper primary/foreign key relationships
Visualizations (bar charts, box plots, scatter plots, heatmaps) using matplotlib and seaborn
Summary statistics and referential integrity validation proving the data is consistent

All from a single command like:

/prompt2data Generate a dataset for an HR system that captures organizational insights,compensation details, iterative changes, and location-based relationships.

Fine-Grained Prompts with Specific Constraints

You can include precise statistical and business constraints directly in your prompt. The agent will honor these instructions when generating data.

/prompt2data Generate an employee attrition dataset where 50% to 70% of employees are active and the rest have exit dates. Salaries should range from $40,000 to $180,000 with a median around $72,000. Ensure at least 15% of employees are in engineering and no department has fewer than 5 people.

/prompt2data Generate IoT sensor data for a manufacturing plant. 80% to 90% of readings should be within normal operating range (temperature 60–80°F, humidity 30–50%). Include 5% to 10% anomalous readings that exceed thresholds, and 2% to 5% missing values to simulate sensor failures.

These fine-grained instructions give you precise control over distributions, thresholds, and data quality characteristics without writing any code.

Conversational Refinement Prompt (Updating Entities and Attributes)

Prompt2Data isn't limited to one-shot generation. You can have a multi-turn conversation to iteratively refine the generated dataset. Here's a full example.

Turn 1 — Initial generation

/prompt2data Generate a dataset for an HR system that captures organizational insights, compensation details, iterative changes, and location-based relationships.

The agent generates tables: Locations, Departments, Job_Titles, Employees, Compensation, Org_Changes.

Turn 2 — Add a new entity

I also need a Benefits table that tracks health plan enrollments per employee,including plan_type, coverage_level, monthly_premium, and enrollment_date.

The agent adds a Benefits entity linked to Employees via employee_id, generates realistic plan data (HMO/PPO/HDHP distributions) and re-validates referential integrity across all tables.

Turn 3 — Modify an existing attribute

Update the Employees table to include a performance_rating column with values 1 through 5, where most employees cluster around 3–4 (normal distribution).

The agent modifies the employee generation function, adds the new column with a realistic bell-curve distribution, re-runs the affected cells, and updates the CSV export.

Turn 4 — Adjust data volumes

Increase the employee count to 2000 and make sure compensation records scale proportionally.

The agent updates row counts in the schema definition, regenerates all dependent tables, and re-exports CSVs.

Architecture Overview

Prompt2Data is not a standalone application. It is combination of user and system prompts that leverages VS Code's GitHub Copilot agent extensibility. The architecture has four layers.

End-to-End Walkthrough

Let's trace what happens when you run:

/prompt2data Generate a dataset for an HR system that captures organizational insights, compensation details, iterative changes, and location-based relationships.

Prompt Analysis (Domain Decomposition)

The agent's first task is to analyze your natural-language input and identify the domain entities. For the HR prompt above, it identifies:

The agent classifies each entity as either a lookup/dimension table (relatively static reference data) or a fact/transaction table (event-driven records). This classification determines the generation order parent tables must be created before child tables to ensure valid foreign keys.

Permission Requests

Before creating files or installing packages, the agent prompts for permission. VS Code shows an "Allow / Skip" dialog for each action:

Creating the project directory
Installing Python packages (`pandas`, `numpy`, `matplotlib`, `seaborn`, `scipy`)
Creating and executing notebook cells
This permission model ensures the agent never takes destructive actions without user consent.

Environment Setup

The agent configures three things:

Python environment — Detects or sets up the active Python interpreter
Notebook environment — Configures Jupyter kernel settings
Package installation — Installs required dependencies

What Gets Generated

Every run produces a notebook with a consistent structure

Data Quality, Compliance, and Security

Realistic Data Patterns

Prompt2Data doesn't just generate random numbers. The agent is instructed to produce data with:

Appropriate statistical distributions: Salaries follow log-normal distributions not uniform random
Correlations between variables: Tenure correlates with compensation level, seniority correlates with department size
Temporal patterns: Seasonal hiring spikes, quarterly compensation reviews
Natural noise and outliers: Just like real data not everything is perfectly clean
Domain constraints: No negative salaries, no future hire dates, no employees reporting to themselves

Try It Yourself

This is available as open source, and all the instructions are provided in the README file, please go through it and give it a try yourself.

Why an Prompt2Data Agent and Not a Script?

A traditional Python script for synthetic data generation is rigid you hardcode schemas, column names and distributions. Every new domain requires a new script. Prompt2Data takes a fundamentally different approach.

The agent leverages the LLM's world knowledge to understand what realistic data looks like for any domains.

Reproducibility — Anyone can re-run the notebook to regenerate data with different parameters
Documentation built-in — Markdown cells explain the schema, relationships, and design decisions inline with the code
Visualizations included — Charts render directly in the notebook without additional setup
Iterative exploration — Users can modify individual cells to adjust distributions, add columns, or change row counts
Cross-platform compatibility — The generated notebooks use standard Python libraries (pandas, numpy, matplotlib, seaborn, scipy) with no VS Code–specific dependencies. This means they can be executed in Azure Databricks, Azure Synapse Analytics, Microsoft Fabric, Google Colab, JupyterHub, or any other environment that supports Jupyter notebooks. Simply upload the .ipynb file and run — no modifications needed.

Conclusion

Prompt2Data demonstrates how VS Code's Copilot agent customization framework can solve real workflow problems with nothing more than well-crafted Markdown files. No custom extensions. No backend services. No API integrations. Just a prompt file that teaches Copilot how to be a data scientist.

The result a single sentence in Copilot Chat becomes a fully normalized, validated, visualized synthetic dataset ready for dashboards, pipeline testing, ML training, or demos.

The entire project is open source. Clone the repo, try the example prompts, and build your own agents.

Prompt2Data: Synthetic Data Generator Agent

Playing the Zork game in Fabric as a Fabric App

Direct Lake Is Changing the Lakehouse vs Warehouse...

Fabric Data Agents - The Orchestration Playbook fo...

Medallion to Magic — Manufacturing Intelligence Pl...

Automating Materialized Lake Views in Fabric with ...

A new Data Days event is coming soon!

Prompt2Data: Synthetic Data Generator Agent

Playing the Zork game in Fabric as a Fabric App

Direct Lake Is Changing the Lakehouse vs Warehouse...

Fabric Data Agents - The Orchestration Playbook fo...

Medallion to Magic — Manufacturing Intelligence Pl...

Automating Materialized Lake Views in Fabric with ...