0

I am creating a project, and I need to be able to use a regex(or if something else is preferable?)

Basically, I need to convert a PHPish markup code page so that the "non-code" is converted into "code." For instance:

Orginal:

<?code
  echo 'some text';
?>
<head>
</head>
<body>
</body>
<?code
  echo '</html>';
?>

Converted:

<?code
  echo '<html>';
  echo '
<head>
</head>
<body>
</body>';
  echo '</html>';
?>

How could this work while also taking quotes into account? (like <?code $var='<?code stuff ?>';?>

Also, if someone provided me with something to detect included files, (to replace with something that first "prepossesses" the file then includes it) (where the includes are similar to PHP)

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

Also, this project will actually be implemented in Ruby(the preprocessor that is), so if there is something Ruby has that would aid in this, then have at it.

I know the code looks very similar to PHP, but thats because it is, but it will not be implemented in PHP and the "code" used won't actually be PHP, but it will use a <? type mechanism for containing code in markup.

Edit: also note that the language inside the markup can for all practical purposes be Ruby. So it can contain quotes and comments that have the closing code tag.

8
  • No, regex is not able to make such a replacement. Commented Feb 14, 2010 at 19:15
  • How would you go about writing a fairly fast parser to do it then? surely regex can help? Commented Feb 14, 2010 at 19:16
  • echoing markup looks suspicious to me. in the end, thats what <?php and ?> are for. are you sure you need this? did you think about output buffering? Commented Feb 14, 2010 at 19:23
  • This is not actually related to PHP, but it is the easiest way I could explain it.. There will not actually be any PHP being transformed, it is for writing something very similar though to how PHP does it's markup. Commented Feb 14, 2010 at 19:25
  • Okay, but you are trying to convert PHP (with HTML embedded) source files, right? Only not using PHP but Ruby, correct? Commented Feb 14, 2010 at 19:35

2 Answers 2

3

You can use token_get_all to get a stream of parser tokens. Loop through them and echo them out, when you come upon a T_INLINE_HTML, you can then rewrite it to an echo statement instead.

Edit - Just saw you say you're using Ruby. Obviously, you can't use PHP's tokeniser from within Ruby. Maybe you can call php over the command line?

Edit 2:

Is this even possible with Regex? I know your not suppose to try to parse HTML with regex, but this isn't trying to parse it, it's really being quite dumb to how the markup and everything is..

It's parsing alright. You can use regexp to split your input into tokens (aka tokenization). Since most languages are contextual, you will then have to feed the tokens to a state machine, which can parse the code into an internal representation (an AST). This can then be transformed into your target output. It sounds elaborate and scary, but it's really quite simple when you have tried it a couple of times. I suggest that you work through it, with the help of Wikipedia and Google.

Sign up to request clarification or add additional context in comments.

3 Comments

Nah, that's not what I'm going for(and the actual code in the markup won't be PHP).. Sorry, changed my question to better reflect my intentions.
Well, not what I was wanting.. but guess it's the answer :( (leave the question open a bit longer just in case though)
Keep in mind that you don't need to write a parser that recognises the entire language. It's enough to tokenise into the parts that has context which is relevant to what you're looking to manipulate. Eg. Split by comment-delimiters, string literal-delimiters, backslashes and the actual markers that you are searching for. That makes for a fairly simple state machine.
0

More a couple of ideas rather than an answer:

I would suggest you try to find some regex that can find the blocks of PHP and then wrap everything else in your echo's instead of the other way round.

Another option may be to look at the PHP tokenizer, but i'm not sure how it deals with sections of HTML outside of the tags I'm afraid.

4 Comments

How about capturing this PHP block: <?php echo 'no closing tag: ?>'; /* also no closing tag ?> */ ?>
Hmm.. good point.. I guess it'll just have to be a hybrid parser.. Replacing all the markup appropriately and parsing everything in <?php to catch tricks like this.
Fair point, perhaps the tokenizer might be worth looking into then.
Indeed, troelskn's answer is the way to go in my opinion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.