There are some questions to be asked:
- Does external data have a timestamp PK or an incremental PK that would allow you to know what were the last data items you already processed?
- Do you need to process only new data or old data that has been modified also?
If the external data has a timestamp PK or incremental PK, the state you need would be the last timestamp or incremental key value processed. You can either store it in one table in the destination database.
I would suggest the following approach:
- A crontab job A reads external database (the consumer should be the responsible of pulling the data) and populates a temporal table.
- Another cron job B reads that data, which is already in your database, and processes it, generating output data in the formal tables. Then it saves the "state" in a table (the last timestamp or incremental key of external database processed).
- Another cron job C deletes from the temporal table any data that was already processed (you can infer which rows to delete by comparing with the "state" value saved by job B)
This approach works for new data only. If you need to process old data (data that was previously modifiedprocessed but has changed in the origin) thethen things get complicated.
Note that this approach is different from your third one (datawharehouse) in that data is only stored temporarily.