Solved: Re: Architecture Help for Internal reporting using...

icassiem · ‎07-30-2025

Good day,

I need help with a reporting strategy for internal users of the product service data. We have two AWS regions, and I was asked to keep as aws native as possible in terms of tools, etc

So I have the issue that I can't set up a small reporting database in one region with ETL glue etc, but it needs to be replicated for the other region

thinking out loud here:

1. Could I use the ETL to copy data to Power BI or Azure as a final dataset, meaning only the ETL resides in both regions, AWS accounts?

2. but this would not be an aws solution as i think azure is cut off

3. can i do it to powerbi but that would be security concern of data residing in powerbi service

i just dont know how i owuld setup an environment, where i can a do self service platform for my internal clients without duplicating to multiple regions mwaning the data management will be complex etc

anybody has ideas i can explore, i thought push for a case for the data to reside in azure as its best partnered with powerbi and we dont need an ec2 gateway and the aws vpc security isses, but i am feeling strong azure will be a no go as aws is the future plan of the org with azure only for it management

but if i can build a case of data under 1TB or is there an aws archtetcture, power bi solution?

i will be managing and developing the space for chapion users to do self service, the azure from aws is a blocker

please help, any ideas?

regards

v-karpurapud · ‎08-17-2025

Hi @icassiem

Here is the clarification regarding your questions:

1) Multi-Region Centralization

This is not required as a separate component but as a design choice. If MWAA in the Analytics account centrally pulls from both regions into the home region’s S3 silver layer, there is no need for additional S3 Cross-Region Replication. Simply choose one mechanism per source type: CRR or MWAA pulls for files, read replicas/snapshots/DMS or secure connections for RDS, and direct API calls from MWAA for APIs.

2) Staging & Data Layout

Yes, silver and gold represent the first and second transformation stages, but the bronze layer should not be skipped. Bronze holds immutable, append-only raw data for audit and replay. Silver stores cleaned and typed Parquet data created from bronze, and gold is optional for analytics extracts or Athena access, with Aurora serving as the main semantic layer. MWAA orchestrates all transitions (bronze → silver → gold).

3) Aurora Auto-Read from S3

Aurora does not auto-ingest from S3. Loads must be scheduled through Airflow using the aws_s3.table_import_from_s3 function or COPY/psycopg2 tasks. The DAG should include partition discovery and idempotent markers to only ingest new partitions reliably.

4) Aurora vs RDS for Semantic Layer

Both integrate with Power BI, but Aurora PostgreSQL Serverless v2 is preferred for auto-scaling, faster failover, and reduced operational overhead. RDS PostgreSQL is a good fit only if workloads are predictable and consistently low-cost. In either case, Airflow manages refreshes and transformations.

5) EC2 Gateway for Power BI

A Windows Server–based EC2 is required for the On-Premises Data Gateway. A t3a.large (2 vCPUs, 8 GiB RAM) is a safe default; smaller workloads may use t3a.medium (4 GiB). Place the instance in a private subnet, allow only outbound internet through NAT, use gp3 EBS (50–100 GB), and consider clustering for high availability.

6) dbt vs Python

dbt Core is recommended if you want a SQL-first approach with lineage, testing, and documentation. It runs well within Airflow but is not AWS-native. If transformations are lightweight and Python-based, Airflow with Python/SQL alone is sufficient. For a GUI or low-code option, AWS Glue DataBrew can complement the stack and be triggered from Airflow.

Regards,
Karpurapu D.

View solution in original post

jaineshp · ‎08-04-2025

Hey @icassiem,

Thanks for the detailed follow-up — it's great to see how committed you are to designing a future-proof and self-service reporting solution. Based on your constraints (AWS-only, under 1TB currently, Glue/S3/Athena as baseline, avoidance of Lambda, preference for job control and managed views), here are three practical options, prioritized by suitability.

Option 1: Athena-Based Architecture with Step Functions and Enhanced Monitoring (Recommended)

Build on your current Glue → S3 → Athena structure, but introduce Step Functions to orchestrate ETL, API loops per client, and manage job state.
Replace Lambda with Glue Python Shell Jobs, giving you more control and flexibility.
Use Athena Views to define and expose only managed, consumable data sets for PowerBI reporting.
Set up SNS and CloudWatch Alarms for failure alerts, retries, and visibility.
Enables a fully serverless, cost-efficient architecture with strong observability.
Approximate Cost: Low — primarily based on data scanned in Athena and Glue job runtime.

Option 2: Introduce RDS PostgreSQL as a Curated SQL Layer

Use RDS (PostgreSQL) as a lightweight, centralized database to hold curated datasets after transformation.
PowerBI connects via on-premises data gateway (can run on EC2 or container).
Step Functions or Glue handle ingestion and transformation, then write to RDS.
Ideal for building re-usable views and joining multiple source tables (e.g., client-level API data).
Approximate Cost: Moderate — small RDS instance (~$80/month) + gateway maintenance.

Option 3: Hybrid Push to PowerBI Service via API (Best for Controlled Volumes)

Push final, transformed datasets directly into PowerBI via the PowerBI REST API — skipping the need for a gateway.
Glue or Step Functions can perform transformations and trigger API calls.
Suitable for small, predefined models and eliminates VPC/networking complexity.
Less ideal for audit logging, recovery, or user-driven exploration.
Approximate Cost: Very low infra cost — pay per execution/API call only.

You're clearly thinking strategically, and this approach will definitely bring long-term benefits to your data consumers.

Fixed? ✓ Mark it • Share it • Help others!

Best Regards,
Jainesh Poojara | Power BI Developer

icassiem · ‎08-04-2025

@v-karpurapud, @jaineshp , @rohit1991 Thank you so much for sharing your knowledge

I am about to start with the strategy, and I want to present three options.

Azure & data warehouse environments like Redshift, Snowflake are not an option, our organisation spent a lot of money migrating to aws over the past 2 years

i need something better than what i currently have of an job schedules > glue > lambda > pyspark api loop > s3 > athena data lake > (no views or manipulation) as i need to prove why we need to move for a centralised self service space where users access the managed dataviews only for writing powerbi reports

if athena remains the best option, how do i expand and improve as i need a job management, exception hanlding/notification. was hoping there was something better than GUi glue, i also dont see the athena data view enviornment. Also note i know it is under 1tb at the moment but what if in 2 years it grows to 2TB for the Hybrid Push Strategy

Please guide me, please provide the 3 best options and in its order you highly advise please, with some detail and as to why/benefits links etc, even roughly cost brackets. the reason why i struggle with glue is that we use lambda's which i dont want to use or pyspark scripting as i need to then use a vendor which we do now and my space is small and dont need his dependency, i want to insert api and loop per table config example to get client api data for all clients and the a data management to manipulate the data, so do i go rds postgress or mysql or ? i would like snowflake or redshift but my datasize does not justify that cost as the purpose is purely internal management reporting on product with roughly 20 clients, data is small but i need to be able to build bussiness logic datasets with sql or ?

please help, i want to place the 3 options over a table to compare and provide my recommendation of the option priority, so that i can be open minded to my CTO that i present this best robust solution as my sources are rds, product api per client (loop), i need alreats and need views to mniputlate and track failures

sorry i know i am asking a lot, but me searching the net i will come know where close to whats best and i really want to add value in my new organization

jaineshp · ‎08-04-2025

Hey @icassiem,

Thank you for the kind recognition - always happy to contribute to our community's success!

Best Regards,
Jainesh Poojara | Power BI Developer

icassiem · ‎08-04-2025

@jaineshp would your reply of earlier be your response to my latest reply/request (30min ago)? Please advise, brainstorm and scrutinise my thinking as I will start putting in place the 3 options. I prefer links to back my case to show my homework done, etc..

jaineshp · ‎08-04-2025

Hey @icassiem,

Thank you for expanding on the requirements—it really helps clarify the situation you're working with. Here's how I'd refine and validate the three options further, based on your latest inputs:

Confirmation: Yes — My Previous Reply Stands, With Some Refinements

Your thinking is very sound — especially your push for:

Central SQL-based data views
Alerting and monitoring
Avoiding PySpark/Lambda-heavy dependencies
Building a scalable, centralized data layer

That said, below is a revised breakdown with some scrutiny, suggestions, and extra links to strengthen your case when pitching to leadership.

Option 1: AWS Step Functions + RDS (PostgreSQL) + S3 + Athena

Still my top recommendation based on cost, flexibility, and simplicity

What works well:

RDS PostgreSQL gives you clean, view-based modeling with SQL (great for BI self-service)
Step Functions offers job orchestration without coding everything from scratch
You can still archive to S3 and query via Athena when needed
Very low Lambda dependency — just use them if absolutely necessary as small wrappers
SNS + CloudWatch for failure alerts and visibility

Scrutiny:

Step Functions state machine debugging can get tricky for large pipelines — use visual workflows & logging
RDS scaling should be monitored, especially with concurrent Power BI users (add read replicas if needed)

Resources to back it up:

Step Functions Intro

Using RDS PostgreSQL for BI

Athena CTAS & Views

Option 2: MWAA (Managed Airflow) + RDS + S3

Use if orchestration complexity will increase soon (e.g. parallel APIs, retries, conditional workflows)

Pros:

Airflow gives you granular DAG-level visibility, retries, logging
You can modularize your logic and control retries per task
Can be extended later to trigger ML models or external systems
Still SQL-centric with PostgreSQL as core transformation layer

Watch out for:

MWAA is more expensive to run even for small loads
Setup time is more than Step Functions, and operational overhead is real
Might feel like "overkill" unless workflows become more complex than they are now

Resources:

MWAA on AWS

Airflow DAG Concepts

Airflow + RDS use case

Option 3: Athena-Only with Better Scheduling + Metadata Layer

Viable if budget is the main blocker, and you don’t anticipate high data transformation needs

Strengths:

Zero infra — Athena on S3 is very cheap and scalable
You can define business views using SQL (via CTAS or Views)
Simple scheduling with EventBridge or Step Functions + Lambda/Python jobs

Weaknesses:

No central orchestration: managing failure, retries, dependencies becomes painful
Not ideal for layered business logic across tables (joins can get slow at >2TB)
Lack of state or metadata tracking without extra tooling (e.g., no “which client failed last?” out of the box)

Fixed? ✓ Mark it • Share it • Help others!

Best Regards,
Jainesh Poojara | Power BI Developer

icassiem · ‎08-05-2025

@jaineshp , @v-karpurapud , @rohit1991 , apologies for a simple questiona, are there no additional perceptives to my updated response & @jaineshp answer yesterday?

As i want to compile the best 3 options and want to be 200% confident everything and best solutions is provided. Putting my neck out here i would be implementing the archtetcure too so i really need to be sure i built a futurastic and modern archetcture, one that i can be proud of and comfrotable meets the objective and provides visible stakeholder change/improvements

@jaineshp the option 1 of: AWS Step Functions + RDS (PostgreSQL) + S3 + Athena is very similar to the current legacy of AWS step function > glue > lambda > crawlers > s3 > athena > ec2 gateway
the symantic layer of view are probably in the postgress but the job management similar to sql job agent is missing where all jobs and tasks are mangeed with alerts etc,, i miss SSIS/Daat factory 🙂 thought there was tool integrated in AWS (DBT example, dont know) that could manage the space of scheduling/orch > etl > data management > build semantic layer > power bi connect

the option 1 also misses the key factor of the (2) multiple regions that we want to centralize and denomrlize the data from client api to various db stages for self service, so that we dont dupl the enviornment per region
Sorry for asking again

v-karpurapud · ‎08-06-2025

Hi @icassiem

Thank you for getting back with us. To design a modern, scalable, and AWS-native internal reporting platform that eliminates current architectural pain points and supports centralized, self-service BI with Power BI, the most suitable approach is to build a centralized, SQL-driven architecture with simplified orchestration, data modeling, and governance.

Option 1: MWAA + RDS (PostgreSQL) + S3 + Athena — Future-Ready Architecture

This architecture presents a fully centralized, SQL-driven, and scalable solution designed for long-term growth and operational efficiency. AWS Managed Workflows for Apache Airflow (MWAA) is leveraged to orchestrate API ingestion, data transformation, retries, and alerting, offering complete visibility and control through DAGs. Raw data is stored in Amazon S3, while business logic is modeled using PostgreSQL views, which can be directly consumed by Power BI for reporting and analytics.

This approach eliminates the need for complex Lambda-based logic, supports multi-region data handling via S3 replication, and ensures centralized job management and failure tracking. It is particularly well-suited for organizations anticipating data growth beyond 2TB and seeking robust orchestration, SQL-based modeling, and future scalability.

Option 2: Step Functions + RDS (PostgreSQL) + S3 + Athena — Lightweight & Cost-Effective Architecture

This architecture offers a streamlined and cost-efficient solution for orchestrating data pipelines using AWS Step Functions. It enables lightweight orchestration of client API loops, data transformations, and alerting through CloudWatch and SNS, while storing raw data in Amazon S3. Business logic is modeled using PostgreSQL views, which can be directly accessed by Power BI for reporting and analytics.

The solution provides visual workflows, built-in error handling, and scheduling capabilities with significantly lower operational overhead compared to MWAA, making it ideal for smaller teams or simpler workloads. It supports centralized, SQL-driven modeling without the complexity of Airflow, and is well-suited for organizations prioritizing simplicity, low-code orchestration, and cost efficiency.Estimated Monthly Cost: Approximately $150–$250

Option 3: Athena + Glue DataBrew + EventBridge — Minimal & Budget-Friendly Architecture

This architecture is designed for teams managing small datasets with minimal orchestration requirements, offering a highly cost-effective and simplified solution. AWS Glue DataBrew enables no-code data preparation, while SQL-based modeling is achieved through Athena Views or CTAS (Create Table As Select). Scheduling is handled via EventBridge or optionally Step Functions, depending on workflow complexity. Although this setup does not provide full orchestration or built-in job monitoring, it is well-suited for use cases where simplicity, low maintenance, and budget constraints are key considerations. Manual intervention may be required for failure tracking and metadata management, but the overall operational overhead remains minimal.Estimated Monthly Cost: Less than $100

I hope this information is useful. If you need more assistance, please provide additional details.

Regards,
Karpurapu D,
Microsoft Fabric Community Support Team.

icassiem · ‎08-07-2025

@v-karpurapud , @jaineshp , @rohit1991 Thank you very much, and apologies for the silly questions here as I need to ensure I highlight the pros/cons when presenting

@v-karpurapud

Option 1: MWAA + RDS (PostgreSQL) + S3 + Athena — Future-Ready Architecture

Option 2: Step Functions + RDS (PostgSQL) + S3 + Athena — Lightweight & Cost-Eff Arch

Option 3: Athena + Glue DataBrew + EventBridge — Minimal & Budget-Friendly Architecture
Questions

A. Is the MWAA the orchestrates and ETL to extract sources (rds mysql/post, api, s3, mongodb), how would it handle data manipulation like we used pyspark/pyhton (denormalise, api loop, manipulate etc) and this eliminates Glue?

B. The raw data is stored in S3 and logic in postgresql views - does this mean that psotgresql and not athena connects directly on the S3 - what about the unstructred daat etc, i thought postgresql only works on rationalised data structures?

C. If the data is stored in S3 i would not be able to store snapshot data overtime correct, meaning it will be driven/refreshed from source only?

@jaineshp with MWAA i looked into this some time back, and the only reason is i was looking for a tool that has many options packed, like @v-karpurapud mentions it can do api loops/data manipulation etc hence i mentioned for example only DBT "i dont know" then i dont need different places to track etl/glue and step/functions for orch and then still pyspark/python to transform manipulate/loop lamnda api calls, but the job scheduling is very simple and basic

Also the elements we missing, are the 2 regions centralised & gateway - we dont want to replicate the environment on both regions, plus is the EC2/Gateway within the VPC needed as the prod product resides in same VPC hence security was not keen

Apologies, please help and provide a bit more detail, i recevied some more clarity yesterday it must be a native AWS self service env on possibly a transformed DB/S3 storage for data snapshot/or not "con if not" and semantic layer with views because the objective might in future expand for sharing data to clients via api & product reporting, the current objective is a draft strategy for internal PBI reporting but i know stakeholders expectations as theres already misalignment. I want to ensure it is simple (to dev/manage/alerts etc), cost effective (+/- $1000 monthly target), modern (design, data flow, tools etc) and yet Expandable, like the data storeage of Postgresql and the MWAA orch/etl I think cost is roughly $150-200 monthly, but if this is the key orch+etl+data maniputlation it is essentailly half the data flow i think i can justify it, but it must make my life much easier as i implement, future develop and maintain.

Me thinking out loud here 🙂
I want to narrow it down to 2 best recommended options, one with S3 storage for internal reporting on current data only with overwrites, no data snapshots, track over time or benchmarking etc data and the other a more upscale one of db store etc ?

I think the challenge outside of the tools etc is the simplicity of dev.manage with data centralization

icassiem · ‎08-10-2025

@jaineshp , @v-karpurapud , @rohit1991 Hi, any update on my last reply above, please?

For security, do I set up a separate account within the one region but outside of the product VPC so that i can run MWAA/glue accross regions into the one database and then it is easier/less risk and answer the security officers push back, to expose the db for users to connect to the db from their pc and dont require to connect to an EC2 to access the db+semantic layer of views?

also can the MWAA/glue do ingest/access as its mostly api/rds across the regions+accounts to the one region new account outside of prod/source data?

if i keep it the one account same a prod product and one region i cant expose the db and have to let uses connect to the EC2, which is not really ideal for Self service cntralised environement

please help

v-karpurapud · ‎08-11-2025

Hi @icassiem

The cleanest way to meet your security requirements and still centralize reporting is to run everything in a dedicated AWS account set up specifically for analytics. This account would sit in one region and host all of your reporting infrastructure: MWAA or Fargate to handle orchestration and API ingestion, a central S3 bucket for raw and transformed data, and an Aurora or RDS instance to hold your curated datasets and semantic views. Keeping this separate from your production accounts isolates any reporting activity from operational systems, reduces security risk, and gives you a clear, compliant boundary for data access.

Data from both regions can flow into this account without replicating the entire environment. In a push model, small extract jobs in each source account write directly into the reporting S3 bucket using a cross-account IAM role that only allows that one action. In a pull model, MWAA or Fargate in the reporting account uses STS to assume a read-only role in the source accounts and fetch from APIs, RDS replicas, or S3. Either way, all transformation, denormalization, and business logic happen in the reporting account, so you’re no longer relying on Lambda/PySpark or juggling jobs across multiple regions.

Analysts would connect to the curated database in a way that satisfies security without forcing an EC2 gateway. The easiest path is AWS Client VPN, which gives them private network access to the reporting VPC and supports corporate authentication, or PrivateLink if you prefer to route through your existing corporate network. The database stays private, fully inside AWS, and only authorized users can reach it. This keeps the design AWS-native, simple to manage, cost-controlled, and flexible enough to expand later with features like historical snapshots or client-facing APIs when the business is ready.

Regards,
Karpurapu D,
Microsoft Fabric Community Support Team.

icassiem · ‎08-11-2025

@v-karpurapud Thank you for your patience

Please check if I understand it correctly, as I have a few ques, the proposed architecture flow is:

1. A Separate AWS account "Analytics" from the Product Account in the one main region

2. Setup Analytics account read access to the two Regions Prod (rds, api, s3)

3. MWAA or Fargate to both job orch + ingest, no need for glue (cross region RDS, API, S3)?

> your recommendation that provides simplest/best gui dev?

> can this do loops of api with params?

4. Source destination of ETL = Stageing S3 Bucket

5. Data Transformation to (Aurora or RDS)

> what is your preference on the DB for denormalise/user access layer/works best for PBI?

> how would i automate/schedule transformation from the S3 Stageing to Semantic Layer of (Aurora or RDS) tables, could i shcedule sql scripts/procs on (Aurora or RDS) to read the staging S3 then write to a table or schedule or a tool/DBT etc?

> I think Aurora +RDS can be serverless, im just thinking which one offers the best sql tramsformation jobs becuase my senior does not want views/dw etc?

6. User Connectivity & report development directly to DB and not EC2

> No EC2 connect ofr db connectivity
> Setup in Analytics acc security of AWS Client VPN or PrivateLink

> which one you recommend for users connecting from desktop to the RDS/Aurora?

7. Don't we need a On-Prem Gateway on a EC2?

Regards

icassiem · ‎08-12-2025

@v-karpurapud , @jaineshp , @rohit1991 any update to my response of yesterday: proposed architecture flow please?

v-karpurapud · ‎08-13-2025

Hi @icassiem

Apologies for the delayed response. Kindly review the following recommendations:

1. Target Landing Zone & Security

Establish a dedicated Analytics AWS account within a single home region, implementing cross-account, least-privilege IAM roles for read-only access to source accounts and regions (APIs, RDS replicas/snapshots, S3). This approach separates analytics from production and streamlines auditing processes.

2. Orchestration & Ingestion

Utilize MWAA (Managed Airflow) for centralized orchestration and ETL, leveraging DAG-based scheduling, retries, logging, and alerts. API loops can be managed with PythonOperator/HttpHook, supporting Python, SQL, or dbt jobs. For smaller pipelines, Step Functions with Lambda is a suitable alternative.

3. Staging & Data Layout

Organize data in S3 using a bronze/silver/gold Parquet model: bronze for immutable raw data, silver for cleaned and typed data, and gold for analytics-ready extracts, including Aurora and optional Athena views.

4. Transform & Semantic Layer

Leverage Aurora PostgreSQL Serverless v2 for curated tables and light transformations, loading data from S3 via aws_s3 extension or Airflow COPY. Store the gold layer as materialized tables with thin views for Power BI integration.

5. Snapshots / Time Travel

Maintain bronze data as append-only in S3 and implement SCD2 or daily snapshot fact tables in Aurora to support history and audit requirements.

6. Multi-Region Centralization

Centralize hosting in the home region and ingest cross-region data using S3 replication, Airflow STS AssumeRole pulls, RDS read replicas, or centralized API calls.

7. User Connectivity & Power BI

Enable Power BI Desktop access via AWS Client VPN, and Power BI Service through a small EC2 Windows Gateway in the Analytics account or via on-prem VPN. Ensure no public database exposure.

8. Cost Guardrails (≤$1,000/month)

Estimated costs: MWAA $150–$250, Aurora several hundred, S3 + Athena tens, EC2 Gateway $30–$60, all well within the target budget.

9. Two Curated Options
Lean: Overwrite silver/gold layers, no long-term snapshots; minimal overhead and rapid implementation.
With History: Append-only bronze, dbt lineage/tests, SCD2/time series for comprehensive audit and historical tracking.

10. Implementation Checklist

Set up Analytics account and private VPC; configure IAM roles.
Deploy MWAA and Airflow DAGs; define bronze/silver/gold layers in S3.
Load data into Aurora; optionally integrate dbt Core.
Configure VPN and Gateway; publish secured Power BI datasets.

Regards,
KarpurapuD.

icassiem · ‎08-14-2025

@v-karpurapud Awesome, Thank You

Please Last few questions:

1. Is "6. Multi-Region Centralization" required, if 2. Orchestration & Ingestion pulls from LZ region and the region 2, meaning the first layer silver extracts pulls from the main region both regions source data of ( S3, API, RDS)?

2. For "3. Staging & Data Layout", the layers are Silver S3 = the First Stage extract from Source with MWAA (DAG/API Python) orch/ingest to Stage , Gold S3 = Second transformed/manipulate from Silver with MWAA orch/ingest (DBT/Python)?

3. Then Aurora/PostgreSQL only automatically Reads the refreshed Gold S3 directly via (aws_s3 extension/Airflow COPY) as user connect layer?

4. For 4. Transform & Semantic Layer, is your preference between Aurora/PostgreSQL in the sense of Point 4 easily automatically reads/refreshes from Gold layer and ease of PowerBI connect?

5. For the EC2 gateway, i think it requires a windows server - is there a preference EC2 type/size?

6. Also is DBT worth it or stick to the python, was looking for gui interface that i dont depend on external partners but i though DBT was purely for Data Warehouse and is it native integrated in AWS? your preference please?

icassiem · ‎08-15-2025

@v-karpurapud please any update on my final questions yesterday/above, to finalize my understanding please

v-karpurapud · ‎08-17-2025

Hi @icassiem

Here is the clarification regarding your questions:

1) Multi-Region Centralization

This is not required as a separate component but as a design choice. If MWAA in the Analytics account centrally pulls from both regions into the home region’s S3 silver layer, there is no need for additional S3 Cross-Region Replication. Simply choose one mechanism per source type: CRR or MWAA pulls for files, read replicas/snapshots/DMS or secure connections for RDS, and direct API calls from MWAA for APIs.

2) Staging & Data Layout

Yes, silver and gold represent the first and second transformation stages, but the bronze layer should not be skipped. Bronze holds immutable, append-only raw data for audit and replay. Silver stores cleaned and typed Parquet data created from bronze, and gold is optional for analytics extracts or Athena access, with Aurora serving as the main semantic layer. MWAA orchestrates all transitions (bronze → silver → gold).

3) Aurora Auto-Read from S3

Aurora does not auto-ingest from S3. Loads must be scheduled through Airflow using the aws_s3.table_import_from_s3 function or COPY/psycopg2 tasks. The DAG should include partition discovery and idempotent markers to only ingest new partitions reliably.

4) Aurora vs RDS for Semantic Layer

Both integrate with Power BI, but Aurora PostgreSQL Serverless v2 is preferred for auto-scaling, faster failover, and reduced operational overhead. RDS PostgreSQL is a good fit only if workloads are predictable and consistently low-cost. In either case, Airflow manages refreshes and transformations.

5) EC2 Gateway for Power BI

A Windows Server–based EC2 is required for the On-Premises Data Gateway. A t3a.large (2 vCPUs, 8 GiB RAM) is a safe default; smaller workloads may use t3a.medium (4 GiB). Place the instance in a private subnet, allow only outbound internet through NAT, use gp3 EBS (50–100 GB), and consider clustering for high availability.

6) dbt vs Python

dbt Core is recommended if you want a SQL-first approach with lineage, testing, and documentation. It runs well within Airflow but is not AWS-native. If transformations are lightweight and Python-based, Airflow with Python/SQL alone is sufficient. For a GUI or low-code option, AWS Glue DataBrew can complement the stack and be triggered from Airflow.

Regards,
Karpurapu D.

icassiem · ‎08-19-2025

@v-karpurapud Thank You so much, i really apprectiate your patience, knowledge sharing and efforts to helping and guiding me. Initially i thought presenting 3 options, but this one sounds/feels like the best fit. Thank You

v-karpurapud · ‎08-19-2025

Hi @icassiem ,

Glad to hear it helps. If you have any other queries, please feel free to raise a new post in the community. We are always happy to assist you.

Thank You.

v-karpurapud · ‎08-03-2025

Hi @icassiem

Thank you for reaching out to the Microsoft Fabric Community Forum.

As recommended by @rohit1991 and @jaineshp the optimal solution for a secure, scalable, AWS-native internal reporting architecture is to implement a Centralized S3 Data Lake governed by AWS Lake Formation. By enabling S3 Cross-Region Replication, data from multiple AWS regions can be consolidated into a single source of truth, eliminating manual duplication and simplifying data management.

AWS Lake Formation will provide fine-grained access controls at the database, table, and column levels, ensuring secure, governed data access for Power BI users while reducing IAM complexity. To simplify current ETL pipelines, AWS Glue DataBrew offers a no-code, visual interface for data transformation and cleansing, making ETL processes manageable for a single developer with automated, repeatable workflows.

For business logic modeling, Athena SQL Views can serve as a semantic layer for immediate needs. If performance and concurrency demands grow, this can be scaled to Amazon Redshift Serverless for a high-performance semantic model without infrastructure overhead.

Power BI will connect directly to AWS data sources via Athena ODBC or Redshift connectors, eliminating the need for EC2-hosted gateways or data movement into Azure. Given your data volume of under 1TB, Import Mode in Power BI remains a viable option to enhance report performance while ensuring data remains within AWS.

This architecture simplifies ETL, strengthens data governance, enables secure self-service BI, and aligns with your organization’s AWS-first strategy providing a future-proof, cost-effective solution for scalable internal reporting.

Regards,
Karpurapu D,
Microsoft Fabric Community Support Team.

jaineshp · ‎08-03-2025

Hey @v-karpurapud,

Thank you for the kind recognition - always happy to contribute to our community's success!

Best Regards,
Jainesh Poojara | Power BI Developer

Architecture Help for Internal reporting using PowerBI

Option 1: Athena-Based Architecture with Step Functions and Enhanced Monitoring (Recommended)

Option 2: Introduce RDS PostgreSQL as a Curated SQL Layer

Option 3: Hybrid Push to PowerBI Service via API (Best for Controlled Volumes)

Confirmation: Yes — My Previous Reply Stands, With Some Refinements