What you would like to accomplish: Introduce catalog-level “join groups” in Fabric/SQL to enable multiple columns across different tables (e.g., fact and dimension tables) to share a consistent dictionary encoding for join keys. This would ensure that identical string or UUID values are assigned the same internal integer code, making joins and group-bys on these keys significantly faster and more memory-efficient. DuckDB Feature Request I made: Shared / Common Dictionary Encoding for Join Keys includes more details on the motivation and potential approaches. BigQuery Feature Request I made https://issuetracker.google.com/issues/442600668 One important detail to highlight is that typical dictionary encoding is local to a column or file, so even if two tables both use dictionary-encoded columns (e.g., customer_id as a string), the integer codes won't match between them. This forces the engine to fall back to comparing the original string values during joins, completely negating the performance benefits of encoding. That’s the core issue this proposal aims to address. How this might work: Allow explicit declaration of “join groups” at the dataset or project level, so that specific columns across tables share the same dictionary domain. Example of a possible SQL API: -- Define a join group to align dictionary codes across tables
CREATE JOIN GROUP customer_id_group (
fact_customer_daily.customer_id,
fact_customer_orders_daily.customer_id
);
-- Optionally rebuild or refresh the dictionary
ALTER JOIN GROUP customer_id_group REBUILD; Internally, Fabric/SQL would persist and reuse a common dictionary for the grouped columns, ensuring that equivalent string/UUID values map to identical encoded integer IDs across tables and partitions. If applicable, reasons why alternative solutions are not sufficient: Manual mapping tables: Today, teams often maintain surrogate key lookup tables to replace string/UUID identifiers with integer IDs. This adds complexity, operational overhead, and data engineering effort. Per-table dictionary encoding (as in Parquet/ORC): While these formats support dictionaries, they are scoped to row groups or files, and don’t provide cross-table consistency. Other information (workarounds you have tried, documentation consulted, etc): Current practice in many warehouses (includingFabric/SQL) is to maintain mapping tables for surrogate integer keys derived from string or UUID identifiers. While effective, this approach requires extra ETL complexity and maintenance. Implementing catalog-level shared dictionaries inFabric/SQL could provide operational simplicity and significant performance improvements for users with large, join-heavy analytic workloads. That said, I also want to acknowledge that it's entirely possible systems like Fabric/SQL may already be implementing something along these lines internally, and as an external user, I wouldn’t necessarily have visibility into whether or how that’s happening.
... View more