Solved: Re: Architecture Help for Internal reporting using... - Page 2

icassiem · ‎07-30-2025

Good day,

I need help with a reporting strategy for internal users of the product service data. We have two AWS regions, and I was asked to keep as aws native as possible in terms of tools, etc

So I have the issue that I can't set up a small reporting database in one region with ETL glue etc, but it needs to be replicated for the other region

thinking out loud here:

1. Could I use the ETL to copy data to Power BI or Azure as a final dataset, meaning only the ETL resides in both regions, AWS accounts?

2. but this would not be an aws solution as i think azure is cut off

3. can i do it to powerbi but that would be security concern of data residing in powerbi service

i just dont know how i owuld setup an environment, where i can a do self service platform for my internal clients without duplicating to multiple regions mwaning the data management will be complex etc

anybody has ideas i can explore, i thought push for a case for the data to reside in azure as its best partnered with powerbi and we dont need an ec2 gateway and the aws vpc security isses, but i am feeling strong azure will be a no go as aws is the future plan of the org with azure only for it management

but if i can build a case of data under 1TB or is there an aws archtetcture, power bi solution?

i will be managing and developing the space for chapion users to do self service, the azure from aws is a blocker

please help, any ideas?

regards

v-karpurapud · ‎08-17-2025

Hi @icassiem

Here is the clarification regarding your questions:

1) Multi-Region Centralization

This is not required as a separate component but as a design choice. If MWAA in the Analytics account centrally pulls from both regions into the home region’s S3 silver layer, there is no need for additional S3 Cross-Region Replication. Simply choose one mechanism per source type: CRR or MWAA pulls for files, read replicas/snapshots/DMS or secure connections for RDS, and direct API calls from MWAA for APIs.

2) Staging & Data Layout

Yes, silver and gold represent the first and second transformation stages, but the bronze layer should not be skipped. Bronze holds immutable, append-only raw data for audit and replay. Silver stores cleaned and typed Parquet data created from bronze, and gold is optional for analytics extracts or Athena access, with Aurora serving as the main semantic layer. MWAA orchestrates all transitions (bronze → silver → gold).

3) Aurora Auto-Read from S3

Aurora does not auto-ingest from S3. Loads must be scheduled through Airflow using the aws_s3.table_import_from_s3 function or COPY/psycopg2 tasks. The DAG should include partition discovery and idempotent markers to only ingest new partitions reliably.

4) Aurora vs RDS for Semantic Layer

Both integrate with Power BI, but Aurora PostgreSQL Serverless v2 is preferred for auto-scaling, faster failover, and reduced operational overhead. RDS PostgreSQL is a good fit only if workloads are predictable and consistently low-cost. In either case, Airflow manages refreshes and transformations.

5) EC2 Gateway for Power BI

A Windows Server–based EC2 is required for the On-Premises Data Gateway. A t3a.large (2 vCPUs, 8 GiB RAM) is a safe default; smaller workloads may use t3a.medium (4 GiB). Place the instance in a private subnet, allow only outbound internet through NAT, use gp3 EBS (50–100 GB), and consider clustering for high availability.

6) dbt vs Python

dbt Core is recommended if you want a SQL-first approach with lineage, testing, and documentation. It runs well within Airflow but is not AWS-native. If transformations are lightweight and Python-based, Airflow with Python/SQL alone is sufficient. For a GUI or low-code option, AWS Glue DataBrew can complement the stack and be triggered from Airflow.

Regards,
Karpurapu D.

View solution in original post

icassiem · ‎07-30-2025

@rohit1991 @jaineshp Wow, Thank You

Let me complete my scorecard this week, have a few things to implement, I did not expect such good help so quickly 🙂

Please help me build this business case, where i present all 3 options, with pros and cons and then make a recommendation A or B of the 3 options Can you help with this in the meantime please?

So my current env is a datalake using glue off S3 and thena, but cost is heavy for small space, self serive not great for users as i need to transform the data but i dont like glue and i am alone so dont want heavy coding solutions with job management and alerts etc

i am starting full swing with this on monday, please bear with me

rohit1991 · ‎07-30-2025

Hi @icassiem ,

Yes, you can keep everything AWS-native without Azure as the final data layer:

Keep your reporting data in S3 (not Azure) and use AWS Glue/ETL to update only what’s needed.
Connect Power BI directly to S3 using Athena (ODBC or built-in connector). Power BI just reads from AWS no need to move data into Azure.
Security is handled at the AWS level (IAM/Lake Formation). Only metadata goes to Power BI.

This avoids duplicating data across clouds, keeps everything compliant, and still allows for self-service reporting in Power BI.

Did it work? ✔ Give a Kudo • Mark as Solution – help others too!

icassiem · ‎07-30-2025

but this is very similar to the current environment (1 region only) of glue to S3 and athena connect to powerbi, i am trying to not only consolidate the two regions data but also wantto improve it as the glue was code heavy with little data manipuation as the lambda api call loops etc i am using a 3rd party which we need to move away from, meaning we moving forward of a 5 year old design that is expensive, i need to prove why we chagning for better. being a sql guy i would have a semantic layer of views with business logic off a dw for users to drag and drop, so i need to thissimlar not just drop data in glue, i want a lyaer where ist user ready

rohit1991 · ‎07-30-2025

Hi @icassiem ,

Option 1: AWS-Native with Power BI Gateway (Most AWS-Native); Keep your reporting DB and ETL in your primary AWS region. Use the Power BI on-premises data gateway (can be hosted on an EC2 instance) to connect Power BI to your AWS database. For the second AWS region, only replicate the critical data using AWS DMS or cross-region ETL, rather than duplicating everything. This keeps most of your stack AWS-native and minimizes data management headaches.

Option 2: Central Data Lake (S3) Use an S3 bucket as your “single source of truth” with cross-region replication. Have both regions’ ETL jobs push data to the same S3 structure. Power BI can connect directly to S3 via connectors (ODBC, Athena, or third-party connectors), so you don’t need a separate database per region. Makes self-service much easier, especially if you automate permissions and folder structure.

Option 3: Azure Hybrid (Best for Tight Power BI Integration) If you want the easiest Power BI experience and don’t mind being a little less AWS-pure, you can have your ETL jobs push the data to Azure Data Lake or Azure SQL DB. Power BI “just works” with Azure data sources no gateway or VPC fuss. For under 1TB, Azure storage is pretty cheap, and you eliminate a ton of maintenance (no EC2, no gateway, no cross-region VPC headaches). Great for organizations where reporting is more important than strict cloud purity.

Did it work? ✔ Give a Kudo • Mark as Solution – help others too!

jaineshp · ‎07-30-2025

Hey @icassiem,

Looking at your multi-region AWS setup with PowerBI requirements, here are some practical approaches to consider:

Option 1: AWS-Native with PowerBI Gateway Set up your reporting database and ETL in your primary AWS region, then use PowerBI's on-premises data gateway (can run on EC2) to connect.

For the secondary region, replicate only critical data using AWS DMS or cross-region replication. This keeps most infrastructure AWS-native while enabling PowerBI connectivity.

Option 2: Centralized Data Lake Approach Use S3 as your central data repository with cross-region replication. Build your ETL processes in both regions but have them write to the same S3 bucket structure.

PowerBI can connect directly to S3 via appropriate connectors, reducing the need for separate databases in each region.

Option 3: Hybrid Push Strategy Since you mentioned under 1TB of data, consider using AWS Glue to transform and push final datasets to PowerBI's cloud service via API. This eliminates the gateway requirement and reduces VPC complexity.

Set up the push process in both regions targeting the same PowerBI workspace.

Building Your Azure Case: For executive buy-in, emphasize that PowerBI's strength lies in its integration with Azure services. Point out that keeping data in Azure eliminates:

EC2 gateway maintenance overhead
VPC security complexities
Cross-region data sync challenges
Licensing costs for gateway infrastructure

The 1TB limit actually works in your favor - it's small enough that Azure storage costs are minimal compared to the operational complexity savings.

Would any of these directions align with your organizational constraints?

Did it work? ✔ Give a Kudo • Mark as Solution – help others too!

icassiem · ‎08-04-2025

@v-karpurapud, @jaineshp , @rohit1991 Thank you so much for sharing your knowledge

I am about to start with the strategy, and I want to present three options.

Azure & data warehouse environments like Redshift, Snowflake are not an option, our organisation spent a lot of money migrating to aws over the past 2 years

i need something better than what i currently have of an job schedules > glue > lambda > pyspark api loop > s3 > athena data lake > (no views or manipulation) as i need to prove why we need to move for a centralised self service space where users access the managed dataviews only for writing powerbi reports

if athena remains the best option, how do i expand and improve as i need a job management, exception hanlding/notification. was hoping there was something better than GUi glue, i also dont see the athena data view enviornment. Also note i know it is under 1tb at the moment but what if in 2 years it grows to 2TB for the Hybrid Push Strategy

Please guide me, please provide the 3 best options and in its order you highly advise please, with some detail and as to why/benefits links etc, even roughly cost brackets. the reason why i struggle with glue is that we use lambda's which i dont want to use or pyspark scripting as i need to then use a vendor which we do now and my space is small and dont need his dependency, i want to insert api and loop per table config example to get client api data for all clients and the a data management to manipulate the data, so do i go rds postgress or mysql or ? i would like snowflake or redshift but my datasize does not justify that cost as the purpose is purely internal management reporting on product with roughly 20 clients, data is small but i need to be able to build bussiness logic datasets with sql or ?

please help, i want to place the 3 options over a table to compare and provide my recommendation of the option priority, so that i can be open minded to my CTO that i present this best robust solution as my sources are rds, product api per client (loop), i need alreats and need views to mniputlate and track failures

sorry i know i am asking a lot, but me searching the net i will come know where close to whats best and i really want to add value in my new organization

Architecture Help for Internal reporting using PowerBI

Hi @icassiem ,

Helpful resources

Power BI Monthly Update - August 2025

Fabric Community Update - August 2025

Win a 3 Day Ticket to FabCon Vienna

Architecture Help for Internal reporting using PowerBI

Hi @icassiem ,

Helpful resources

Power BI Monthly Update - August 2025

Fabric Community Update - August 2025