Enumerate regular expressions via UglifyJS

Question

I have some JavaScript code, from which I need to find start+end indexes of every literal regular expression.

How can such information be extracted from UglifyJS?

var uglify = require('uglify-js');
var code = "func(1/2, /hello/);";
var parsed = uglify.parse(code);

The structure I'm getting into variable parsed is very complex. And all I need is an array of [{startIdx, endIdx}, {startIdx, endIdx}] of every literal regular expression.

P.S. If somebody thinks that the same task can be accomplished in a way that's better than via UglifyJS, you are welcome to suggest!

UPDATE

I know if I dig deeper into the parsed structure, then for every regular expression I can find object:

AST_Token {
     raw: '/hello/',
     file: null,
     comments_before: [],
     nlb: false,
     endpos: 17,
     endcol: 17,
     endline: 1,
     pos: 10,
     col: 10,
     line: 1,
     value: /hello/,
     type: 'regexp'
}

I need to figure out how to pull all such objects from the parsed structure, so I can compile the list of position indexes.

Are you ever going to tell us what you are going to do with these extracted regexps or strings or their indices once you've extracted them? — user663031
– user663031, Commented Dec 30, 2015 at 7:57
@torazaburo it is part of another parser that's already finished, except for supporting regular expressions properly. I've managed to isolate about 99% of all cases, but the last 1% seems impossible without full-expression evaluation. I just need to know where regular expressions are located within any given line of code. — vitaly-t
– vitaly-t, Commented Dec 30, 2015 at 8:00

vitaly-t · Accepted Answer · 2016-01-01 12:49:05Z

1

I got this ultimately useful link to the UglifyJS author's blog post, which pointed me in the right direction. Based on that blog I was able to modify my enumeration code to the following:

function enumRegEx(parsed) {
    var result = [];
    parsed.walk(new uglify.TreeWalker(function (obj) {
        if (obj instanceof uglify.AST_RegExp) {
            result.push({
                startIdx: obj.end.col,
                endIdx: obj.end.endcol
            });
        }
    }));
    return result;
}

Not only this thing is shorter and works the same, but its processing speed is almost instant, within 10ms, which puts the previous result (430ms) to shame.

Now that is the result I was looking for! :)

UPDATE: In the end though, I found out that for this particular task esprima is a much better choice. It is much faster and has full ES6 support, unlike UglifyJS.

The very same task done via esprima, thanks to the excellent support from Ariya Hidayat:

function parseRegEx(originalCode) {
    var result = [];
    esprima.tokenize(originalCode, {loc: true, range: true}, function (obj) {
        if (obj.type === 'RegularExpression') {
            result.push({
                startIdx: obj.range[0],
                endIdx: obj.range[1]
            });
        }
    });
    return result;
}

As you can see, with esprima you do not even need to parse the code, you pass in the original code instead, which esprima will only tokenize, which is way faster.

edited Jan 1, 2016 at 12:49

answered Dec 30, 2015 at 16:15

vitaly-t

26.1k17 gold badges129 silver badges151 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

vitaly-t Over a year ago

@HenriqueBarcelos StackOverflow doesn't allow accepting own answers for 2 days after answering.

user663031 Over a year ago

But esprima was explicitly suggested as the best solution in a related question you posted, and IIRC you dismissed it as being overkill.

vitaly-t Over a year ago

@torazaburo that suggestion in a related question didn't help at all, it wasn't specific enough, not even close. But thanks for your immediate downvote on this question here.

user663031 Over a year ago

Please don't try to rewrite history. Suggestions in comments are not required to be full-blown solutions. Esprima is not rocket science, and once given the idea of using a parser, figuring out how to do what you wanted to with it is quite simple. You rejected the idea of using it, if you will recall, not because the suggestion was not specific, but because you did not want to use a parser, characterizing it as "overkill". In any case, I'm glad you finally figured out that you do need a parser.

vitaly-t Over a year ago

I left a comment on this subject moments ago on your answer for the other question, which is more appropriate, i think ;)

vitaly-t · Accepted Answer · 2015-12-30 14:30:34Z

Since nobody has answered yet, I have managed to come up with a head-on solution that works, though perhaps not the best one.

function enumRegEx(parsed) {
    var result = [];

    function loop(obj) {

        if (obj && typeof obj === 'object') {
            if (obj.used) {
                return;
            } else {
                obj.used = true;
            }
            if (obj instanceof Array) {
                obj.forEach(function (d) {
                    loop(d);
                });
            } else {
                if (obj instanceof uglify.AST_Node) {
                    for (var v in obj) {
                        loop(obj[v]);
                    }
                } else {
                    if (obj instanceof uglify.AST_Token) {
                        if (obj.type === 'regexp') {
                            result.push({
                                startIdx: obj.col,
                                endIdx: obj.endcol
                            });
                        }
                    }
                }
            }
        }
    }

    loop(parsed);
    return result;
}

The things I don't like about such approach:

I'm using it against a huge, 30,000 lines JavaScript file, which gets parsed by UglifyJS in 240ms, and then my algorithm takes another 430ms just to enumerate regular expressions. This seems quite inefficient.
I have to modify the original objects with property used because the parsed structure uses mutual references, which otherwise results in infinite loops and running out of call stack. Although I'm not worried about that very much, since I'm not using the parsed data for anything else.

If you know a better approach - please, throw it in! At this point I'm mostly interested in improving the performance of my enumeration, which is currently quite slow, compared to the actual parsing that is.

Collectives™ on Stack Overflow

Enumerate regular expressions via UglifyJS

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related