0

I have an XML file containing a number of HTTP responses including the HTTP headers, I am wanting to write the individual responses out to file with just the content not the header. I am struggling to remove the HTTP headers at the start of the file with out messing with the rest

#!/usr/bin/perl
use XML::Simple;
use MIME::Base64;
use URI::Escape;

#CheckArgs
....
my $input = $ARGV[0];

# Parse XML
my $xml = new XML::Simple;
my $data = $xml->XMLin("$input");

# Iterate through the file
for (my $i=0; $i < @{$data->{item}}; $i++){ 
    my $status = $data->{item}[$1]->{status};
    my $path = $data->{item}[$i]->{path};
    if ($status != "200") {
        print "Skipping $path due to status of $status\n";
        next;
    }
    print "$status $path\n";
    my $filename = uri_escape($path);
    # The Content is Base64 Encoded
    my $encoded = $data->{item}[$i]->{response}->{content};
    my $decoded = decode_base64($encoded);

    # Remove HTTP headers
    $decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//gm; 
    open(IMGFILE, "> $filename") or die("Can't open $filename: ".$@);
    binmode IMGFILE;
    print IMGFILE $decoded;
    close IMGFILE;
}

$decoded looks like this before before the search and replace

HTTP/1.1 200 OK
Server: nginx
Date: Thu, 12 Nov 2025 20:79:99 GMT
Content-Type: application/pdf
Content-Length: 88151
Last-Modified: Mon, 14 Sep 2025 20:79:99 GMT
Connection: keep-alive
ETag: "123123-123546"
Expires: Thu, 19 Nov 2025 20:79:99 GMT
Cache-Control: max-age=123456
Accept-Ranges: bytes


%PDF-1.6
%âãÏÓ
54 0 obj
<< 
/Linearized 1 
/O 56 
/H [ 720 305 ] 
/L 45164 
/E 7644 
/N 10 
/T 43966 
>> 
endobj
[Lots more binary and text]

So I am trying to match from the start of the file to the first instance of two new lines with the following line:

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m;
# s => Search Replace
# ^ => Start of file
# (.*?) => Non-greedy match anything including \r and \n
# ((\r\n)|\n|\r){2} => two new lines 
# // => Replace with empty string
# m multiline to allow . to match \r\n

After an amount of playing with the regex I am failing to get result I want, from the example above I would want my new file starting with the characters %PDF-1.6 those characters and everything after them should be unaltered. Please note the PDF file is just an example, there a lot of other file types I want this to work with.

EDIT 1

$decoded =~ s/^(.*?)((\r\n)|\n|\r){2}//m; 
# matches \r\n due to or. So Try
$decoded =~ s/^(.*?)((\r\n)|([^\r]\n)|(\r[^\n])){2}//m;
3
  • 1
    ((\r\n)|\n|\r){2} is wrong since it can match a single newline \r\n, change it to (?:\n\n|\r\n?\r\n?) Commented Nov 12, 2015 at 23:17
  • 1
    How about s/^.*?\R{2,}//s Commented Nov 12, 2015 at 23:26
  • @Borodin is you put that in as an answer I would give you internet points! (It Worked thanks heaps!) Commented Nov 12, 2015 at 23:32

1 Answer 1

1

m multiline to allow . to match \r\n

The /m modifier affects only the ^ and $ characters. You need /s which allows . to match LF

((\r\n)|\n|\r){2} => two new lines

There is a metacharacter that does this already - \R

I suggest that something like

$decoded =~ s/^.*?\R{2,}//s

will do what you want

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you very much, here are some internet points! But it looks like you already have quite a few :)
@DavidWaters: I'm pleased to have helped you. Ignore the numbers, I'm just a regular guy who happens to know some stuff about programming!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.