Regex for matching markup in PHPish markup?

Question

I am creating a project, and I need to be able to use a regex(or if something else is preferable?)

Basically, I need to convert a PHPish markup code page so that the "non-code" is converted into "code." For instance:

Orginal:

<?code
  echo 'some text';
?>
<head>
</head>
<body>
</body>
<?code
  echo '</html>';
?>

Converted:

<?code
  echo '<html>';
  echo '
<head>
</head>
<body>
</body>';
  echo '</html>';
?>

How could this work while also taking quotes into account? (like <?code $var='<?code stuff ?>';?>

Also, if someone provided me with something to detect included files, (to replace with something that first "prepossesses" the file then includes it) (where the includes are similar to PHP)

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

Also, this project will actually be implemented in Ruby(the preprocessor that is), so if there is something Ruby has that would aid in this, then have at it.

I know the code looks very similar to PHP, but thats because it is, but it will not be implemented in PHP and the "code" used won't actually be PHP, but it will use a <? type mechanism for containing code in markup.

Edit: also note that the language inside the markup can for all practical purposes be Ruby. So it can contain quotes and comments that have the closing code tag.

How would you go about writing a fairly fast parser to do it then? surely regex can help? — Earlz
– Earlz, Commented Feb 14, 2010 at 19:16
echoing markup looks suspicious to me. in the end, thats what <?php and ?> are for. are you sure you need this? did you think about output buffering? — ax.
– ax., Commented Feb 14, 2010 at 19:23
This is not actually related to PHP, but it is the easiest way I could explain it.. There will not actually be any PHP being transformed, it is for writing something very similar though to how PHP does it's markup. — Earlz
– Earlz, Commented Feb 14, 2010 at 19:25
Okay, but you are trying to convert PHP (with HTML embedded) source files, right? Only not using PHP but Ruby, correct? — Bart Kiers
– Bart Kiers, Commented Feb 14, 2010 at 19:35

troelskn · Accepted Answer · 2010-02-14 19:45:18Z

3

You can use token_get_all to get a stream of parser tokens. Loop through them and echo them out, when you come upon a T_INLINE_HTML, you can then rewrite it to an echo statement instead.

Edit - Just saw you say you're using Ruby. Obviously, you can't use PHP's tokeniser from within Ruby. Maybe you can call php over the command line?

Edit 2:

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

It's parsing alright. You can use regexp to split your input into tokens (aka tokenization). Since most languages are contextual, you will then have to feed the tokens to a state machine, which can parse the code into an internal representation (an AST). This can then be transformed into your target output. It sounds elaborate and scary, but it's really quite simple when you have tried it a couple of times. I suggest that you work through it, with the help of Wikipedia and Google.

edited Feb 14, 2010 at 19:45

answered Feb 14, 2010 at 19:30

troelskn

118k27 gold badges135 silver badges156 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Earlz Over a year ago

Nah, that's not what I'm going for(and the actual code in the markup won't be PHP).. Sorry, changed my question to better reflect my intentions.

Earlz Over a year ago

Well, not what I was wanting.. but guess it's the answer :( (leave the question open a bit longer just in case though)

troelskn Over a year ago

Keep in mind that you don't need to write a parser that recognises the entire language. It's enough to tokenise into the parts that has context which is relevant to what you're looking to manipulate. Eg. Split by comment-delimiters, string literal-delimiters, backslashes and the actual markers that you are searching for. That makes for a fairly simple state machine.

Jake Worrell · Accepted Answer · 2010-02-14 19:22:16Z

0

More a couple of ideas rather than an answer:

I would suggest you try to find some regex that can find the blocks of PHP and then wrap everything else in your echo's instead of the other way round.

Another option may be to look at the PHP tokenizer, but i'm not sure how it deals with sections of HTML outside of the tags I'm afraid.

answered Feb 14, 2010 at 19:22

Jake Worrell

1351 silver badge6 bronze badges

4 Comments

Bart Kiers Over a year ago

How about capturing this PHP block: <?php echo 'no closing tag: ?>'; /* also no closing tag ?> */ ?>

Earlz Over a year ago

Hmm.. good point.. I guess it'll just have to be a hybrid parser.. Replacing all the markup appropriately and parsing everything in <?php to catch tricks like this.

Jake Worrell Over a year ago

Fair point, perhaps the tokenizer might be worth looking into then.

Bart Kiers Over a year ago

Indeed, troelskn's answer is the way to go in my opinion.

Collectives™ on Stack Overflow

Regex for matching markup in PHPish markup?

2 Answers 2

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related