0
  1. CDC -- Pipeline1 work is to load data (have list of tables) based on timestamp columns (creation_date, updation_date) from replica db (RDS) to S3 (landing_zone)

  2. If I created rds connection on glue, would I need to use jdbc again as I told I need to pass tables and filter conditions, else need to load data and filter after loading.

    connection_options = { "connectionName": "myconnection", "database": "dbname" }

         for table in tables_list:
             connection_options['dbtable'] = table
             # how to pass query if we filter here only
    
    
             datasource = glueContext.create_dynamic_frame.from_options(
                 connection_type="postgresql",
                 connection_options=connection_options
                 )
              output_path = ""
              glueContext.write_dynamic_frame.from_options(
                     frame=datasource,
                     connection_type="s3",
                     connection_options={"path": output_path},
                     format="csv"
                     )
    

using jdbc i can able to pass query and get data but (the connection is configured with vpc, without vpn it can't connected)

Any solutions to filter or is it ok to filter on loading data.

4
  • Are the filter conditions based on the timestamp columns? Or there is further filtering after? You can create a spark dataframe and then run a following query to filter the data, lastly, convert the result back to dynamicframe. Commented Nov 12, 2024 at 19:04
  • In pipeline1 no transformations, as we are loading raw data based on timestamp columns (daily data), so at db level if we don't do filter (using query) we need to load all data and filter then, what do now? Commented Nov 13, 2024 at 2:56
  • Have you considered using glue job bookmarks? Here is the docs docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html .In summary, you specify a primary key or a combination of keys to load data, when you pull the data from the db, it will only pull new data (based on the primary key). For example, if you have a table with 1 million rows today, it will pull the 1m rows, and tomorrow, it will only pull new rows, for example 10K. Let me know if this will be useful, and I can give you a more concrete example as a response. Commented Nov 13, 2024 at 9:21
  • need to load historical data also based on timestamp column, for future data we can work with bookmark. right now i am seeing only option is to load and add filter, I would have used jdbc option but connection is secured and its not allowing to connect. (throigh connection name only i need to work) Commented Nov 13, 2024 at 14:46

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.