Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric certified for FREE! Don't miss your chance! Learn more

Feature Request: Shared / Common Key Dictionary Encoding for Join Keys

What you would like to accomplish:

Introduce catalog-level “join groups” in Fabric/SQL to enable multiple columns across different tables (e.g., fact and dimension tables) to share a consistent dictionary encoding for join keys. This would ensure that identical string or UUID values are assigned the same internal integer code, making joins and group-bys on these keys significantly faster and more memory-efficient.

DuckDB Feature Request I made: Shared / Common Dictionary Encoding for Join Keys includes more details on the motivation and potent...https://issuetracker.google.com/issues/442600668 

One important detail to highlight is that typical dictionary encoding is local to a column or file, ...

How this might work:

  • Allow explicit declaration of “join groups” at the dataset or project level, so that specific columns across tables share the same dictionary domain.
  • Example of a possible SQL API:
    -- Define a join group to align dictionary codes across tables
    CREATE JOIN GROUP customer_id_group (
      fact_customer_daily.customer_id,
      fact_customer_orders_daily.customer_id
    );
    
    -- Optionally rebuild or refresh the dictionary
    ALTER JOIN GROUP customer_id_group REBUILD;

    Internally, Fabric/SQL would persist and reuse a common dictionary for the grouped columns, ensuring that equivalent string/UUID values map to identical encoded integer IDs across tables and partitions.

    If applicable, reasons why alternative solutions are not sufficient:

    • Manual mapping tables: Today, teams often maintain surrogate key lookup tables to replace string/UUID identifiers with integer IDs. This adds complexity, operational overhead, and data engineering effort.
    • Per-table dictionary encoding (as in Parquet/ORC): While these formats support dictionaries, they are scoped to row groups or files, and don’t provide cross-table consistency.

      Other information (workarounds you have tried, documentation consulted, etc):

      • Current practice in many warehouses (includingFabric/SQL) is to maintain mapping tables for surrogate integer keys derived from string or UUID identifiers. While effective, this approach requires extra ETL complexity and maintenance.
      • Implementing catalog-level shared dictionaries inFabric/SQL could provide operational simplicity and significant performance improvements for users with large, join-heavy analytic workloads.
      • That said, I also want to acknowledge that it's entirely possible systems like Fabric/SQL may already be implementing something along these lines internally, and as an external user, I wouldn’t necessarily have visibility into whether or how that’s happening.
Status: New