Aws glue, db connection and data load based on timestamp

Ask Question

Asked 1 year ago

Modified 1 year ago

Viewed 108 times

Part of AWS Collective

CDC -- Pipeline1 work is to load data (have list of tables) based on timestamp columns (creation_date, updation_date) from replica db (RDS) to S3 (landing_zone)

If I created rds connection on glue, would I need to use jdbc again as I told I need to pass tables and filter conditions, else need to load data and filter after loading.

connection_options = { "connectionName": "myconnection", "database": "dbname" }

     for table in tables_list:
         connection_options['dbtable'] = table
         # how to pass query if we filter here only


         datasource = glueContext.create_dynamic_frame.from_options(
             connection_type="postgresql",
             connection_options=connection_options
             )
          output_path = ""
          glueContext.write_dynamic_frame.from_options(
                 frame=datasource,
                 connection_type="s3",
                 connection_options={"path": output_path},
                 format="csv"
                 )

using jdbc i can able to pass query and get data but (the connection is configured with vpc, without vpn it can't connected)

Any solutions to filter or is it ok to filter on loading data.

asked Nov 12, 2024 at 12:04

Abhi5421

334 bronze badges

Are the filter conditions based on the timestamp columns? Or there is further filtering after? You can create a spark dataframe and then run a following query to filter the data, lastly, convert the result back to dynamicframe.

Francisco Parrilla
– Francisco Parrilla

2024-11-12 19:04:34 +00:00
Commented Nov 12, 2024 at 19:04
In pipeline1 no transformations, as we are loading raw data based on timestamp columns (daily data), so at db level if we don't do filter (using query) we need to load all data and filter then, what do now?

Abhi5421
– Abhi5421

2024-11-13 02:56:48 +00:00
Commented Nov 13, 2024 at 2:56
Have you considered using glue job bookmarks? Here is the docs docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html .In summary, you specify a primary key or a combination of keys to load data, when you pull the data from the db, it will only pull new data (based on the primary key). For example, if you have a table with 1 million rows today, it will pull the 1m rows, and tomorrow, it will only pull new rows, for example 10K. Let me know if this will be useful, and I can give you a more concrete example as a response.

Francisco Parrilla
– Francisco Parrilla

2024-11-13 09:21:26 +00:00
Commented Nov 13, 2024 at 9:21
need to load historical data also based on timestamp column, for future data we can work with bookmark. right now i am seeing only option is to load and add filter, I would have used jdbc option but connection is secured and its not allowing to connect. (throigh connection name only i need to work)

Abhi5421
– Abhi5421

2024-11-13 14:46:51 +00:00
Commented Nov 13, 2024 at 14:46

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Aws glue, db connection and data load based on timestamp

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest