-1

Looking for a suggestion that would be much faster. I have a large (232GB) file mongo backup. I want to take out only the April 24th lines and make a new file containing only this date or any date of my choosing. The grep statement below works with "cat" but it takes a long time about 1.5 hours. I am piping commands to run others behind the cat before the grep. Can anyone suggest a better way to accomplish this? In this mongo file there is a log entry per line, so greping by the specific string works. This command runs on a 56 Core, 500 Gig RAM machine, but old spinning disks. Sadly I don't have access to the daily backups with a single day after the monthly file is built.

cat /mnt/backup/mongoexport/logs_2025-04.json | grep -E '.*"entrytimestamp":{"\$date":"2025-02-24' >> /tmp/logs_2025-04-24.json

File:

{"_id":{"$oid":"1"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-05T00:00:00.000Z"}}
{"_id":{"$oid":"2"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-07T00:00:00.000Z"}}
{"_id":{"$oid":"3"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-27T00:00:00.000Z"}}
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"6"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-06T00:00:00.000Z"}}
{"_id":{"$oid":"7"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-08T00:00:00.000Z"}}
{"_id":{"$oid":"8"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-29T00:00:00.000Z"}}
{"_id":{"$oid":"9"},"SEID":"2302","entrytimestamp":{"$date":"2025-05-07T00:00:00.000Z"}}
{"_id":{"$oid":"10"},"SEID":"2302","entrytimestamp":{"$date":"2025-05-07T00:00:00.000Z"}}

Expected output file:

{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
21
  • 4
    quantify takes a long time ... 5 mins? 30 mins? 3 hrs? 4 days? Commented May 4 at 0:32
  • 1
    you've probably already considered these but since there's no mention in the question I'm going to throw them out there ... would it make more sense to run a mongo extract for just the day in question? ... would it be possible to run the grep on the host where the data file resides? Commented May 4 at 0:42
  • 2
    if you know the data is sorted by date you could use something like awk (or perl, python, etc) to halt processing once you're 'past' the date of interest; obviously the biggest time savings would come with dates early in the month; for dates later in the month you could try tac file | awk '{scan_for_24th; exit_on_seeing_23rd}' | tac Commented May 4 at 1:05
  • 2
    Why do you make cat and grep? Simply use grep '"entrytimestamp":{"\$date":"2025-02-24' /mnt/backup/mongoexport/logs_2025-04.json Commented May 4 at 5:25
  • 2
    if you need to run these 'specific date' searches on a regular basis then it may make a lot of sense to not only perform backups on a 'day' basis (as opposed to current monthly basis) but to also look at breaking out the monthly backup files into daily files; the last thing you want is to find yourself repeatedly scanning a given monthly backup file for different days (ie, a single scan of a monthly backup file could be used to generate a full set of daily files) Commented May 4 at 19:46

1 Answer 1

1

Try these:

$ grep -F '"entrytimestamp":{"$date":"2025-02-24' file
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}

$ awk -v s='"entrytimestamp":{"$date":"2025-02-24' 'index($0,s){print; f=1; next} f{exit}' file
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}

The first speeds up grep by doing string instead of regexp matching, the second probably will be faster still depending on where in the input the matching lines occur since it exits immediately upon hitting the line after the matching lines but may not be faster since awk generally does more work per input line than grep.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.