Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
Hi all
I got stuck with a simple task
I need to read from PostgresSQL that sits on prem behind gateways.
The gateway works for workflow and copy activity, but is not available in the notebook.
Postgres has some huge tables, I need to read only delta.
I need to lookup max recorded timestamp of the target table, read records after the timestamp from the source table, merge.
I prefer to merge data into Lakehouse table, that will be my bronse layer.
Sadly I can't find how to achieve this.
I tried pipeline.
First, I need lookup the latest timestamp.
I tried script. It does not apply to the datalake source (regarless it has an SQL endpoint !?!)
so I employed a "lookup" activity.
I tried to play with "add dynamic context" and apply function "max"
This is from the documentation:
Is this use case is for people who can't figure which one of harcoded values are bigger and need to employ a function over them?
I can't figure how to apply this to a column.
If anyone knows how to do that please help!
Also, I'm very much concerned about performance. Is it going to load the terrabyte table to memory to calculate max of the column?
Next, Spark.
Spark can't read from gatewayed sources.
Finally, I tred a dataflow gen2.
First query selects max state.
Send use it as a filter.
So far no luck
--2nd step--
Please help figuring how to use result of the previous query in a filter of the next one please!
Solved! Go to Solution.
Hi @silly_bird,
To get the MAX value of a column, you could use the lookup activity in combination with a sql server connection. As I can see in your screenshots, you have a lakehouse connection defined. Instead of defining a lakehouse connection, you should set up a sql server connection and use the sql endpoint from the lakehouse in the settings. If you have a sql server connection, you are able to query the table in the lookup activity. In that query, you can directly take the MAX value within your select statement. In the next steps of the data pipeline, you can use the queried max value.
Hi @silly_bird,
To get the MAX value of a column, you could use the lookup activity in combination with a sql server connection. As I can see in your screenshots, you have a lakehouse connection defined. Instead of defining a lakehouse connection, you should set up a sql server connection and use the sql endpoint from the lakehouse in the settings. If you have a sql server connection, you are able to query the table in the lookup activity. In that query, you can directly take the MAX value within your select statement. In the next steps of the data pipeline, you can use the queried max value.
Yes, I also found this thought manually defining datalake as Azure Sql Server
Then my activities pipeline looks like that
Parametrisation of the query in the copy was very difficult to figure out, but throught searching browerstack etc. I came up with the following formula:
@concat('SELECT * FROM data.stockmaster',
if(or(equals(activity('stockmaster timestamp').status, 'Failed'),
equals(activity('stockmaster timestamp').output.firstRow.lastmodified, null)),
'',
concat(' WHERE lastmodified > ', '''', string(activity('stockmaster timestamp').output.firstRow.lastmodified), '''')))
User | Count |
---|---|
2 | |
2 | |
1 | |
1 | |
1 |