Suggestions for making a file from a bigger file with grep or?

Question

Looking for a suggestion that would be much faster. I have a large (232GB) file mongo backup. I want to take out only the April 24th lines and make a new file containing only this date or any date of my choosing. The grep statement below works with "cat" but it takes a long time about 1.5 hours. I am piping commands to run others behind the cat before the grep. Can anyone suggest a better way to accomplish this? In this mongo file there is a log entry per line, so greping by the specific string works. This command runs on a 56 Core, 500 Gig RAM machine, but old spinning disks. Sadly I don't have access to the daily backups with a single day after the monthly file is built.

cat /mnt/backup/mongoexport/logs_2025-04.json | grep -E '.*"entrytimestamp":{"\$date":"2025-02-24' >> /tmp/logs_2025-04-24.json

File:

{"_id":{"$oid":"1"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-05T00:00:00.000Z"}}
{"_id":{"$oid":"2"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-07T00:00:00.000Z"}}
{"_id":{"$oid":"3"},"SEID":"bf2abd4c","entrytimestamp":{"$date":"2025-01-27T00:00:00.000Z"}}
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"6"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-06T00:00:00.000Z"}}
{"_id":{"$oid":"7"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-08T00:00:00.000Z"}}
{"_id":{"$oid":"8"},"SEID":"83ba","entrytimestamp":{"$date":"2025-03-29T00:00:00.000Z"}}
{"_id":{"$oid":"9"},"SEID":"2302","entrytimestamp":{"$date":"2025-05-07T00:00:00.000Z"}}
{"_id":{"$oid":"10"},"SEID":"2302","entrytimestamp":{"$date":"2025-05-07T00:00:00.000Z"}}

Expected output file:

{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}

quantify takes a long time ... 5 mins? 30 mins? 3 hrs? 4 days? — markp-fuso
– markp-fuso, Commented May 4 at 0:32
you've probably already considered these but since there's no mention in the question I'm going to throw them out there ... would it make more sense to run a mongo extract for just the day in question? ... would it be possible to run the grep on the host where the data file resides? — markp-fuso
– markp-fuso, Commented May 4 at 0:42
if you know the data is sorted by date you could use something like awk (or perl, python, etc) to halt processing once you're 'past' the date of interest; obviously the biggest time savings would come with dates early in the month; for dates later in the month you could try tac file | awk '{scan_for_24th; exit_on_seeing_23rd}' | tac — markp-fuso
– markp-fuso, Commented May 4 at 1:05
Why do you make cat and grep? Simply use grep '"entrytimestamp":{"\$date":"2025-02-24' /mnt/backup/mongoexport/logs_2025-04.json — Wernfried Domscheit
– Wernfried Domscheit, Commented May 4 at 5:25
if you need to run these 'specific date' searches on a regular basis then it may make a lot of sense to not only perform backups on a 'day' basis (as opposed to current monthly basis) but to also look at breaking out the monthly backup files into daily files; the last thing you want is to find yourself repeatedly scanning a given monthly backup file for different days (ie, a single scan of a monthly backup file could be used to generate a full set of daily files) — markp-fuso
– markp-fuso, Commented May 4 at 19:46

Ed Morton · Accepted Answer · 2025-05-14 22:43:00Z

1

Try these:

$ grep -F '"entrytimestamp":{"$date":"2025-02-24' file
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}

$ awk -v s='"entrytimestamp":{"$date":"2025-02-24' 'index($0,s){print; f=1; next} f{exit}' file
{"_id":{"$oid":"4"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}
{"_id":{"$oid":"5"},"SEID":"613200b325f2","entrytimestamp":{"$date":"2025-02-24T00:00:00.000Z"}}

The first speeds up grep by doing string instead of regexp matching, the second probably will be faster still depending on where in the input the matching lines occur since it exits immediately upon hitting the line after the matching lines but may not be faster since awk generally does more work per input line than grep.

edited May 14 at 22:43

answered May 13 at 13:37

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Suggestions for making a file from a bigger file with grep or?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related