Subset String Array based on length

Question

I have a vector with > 30000 words. I want to create a subset of this vector which contains only those words whose length is greater than 5. What is the best way to achieve this?

Basically df contains mutiple sentences.

So,

wordlist = df2;
wordlist = [strip(wordlist[i]) for i in [1:length(wordlist)]];

Now, I need to subset wordlist so that it contains only those words whose length is greater than 5.

Please edit your question and add some example code and what you've tried so far. — jub0bs
– jub0bs, Commented Sep 29, 2015 at 8:53

Reza Afzalan · Accepted Answer · 2015-09-29 12:22:26Z

1

 sub(A,find(x->length(x)>5,A)) # => creates a view (most efficient way to make a subset)

EDIT: getindex() returns a copy of desired elements

getindex(A,find(x->length(x)>5,A)) # => makes a copy

edited Sep 29, 2015 at 12:22

answered Sep 29, 2015 at 9:06

Reza Afzalan

5,7463 gold badges28 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Saurabh Wardhane Over a year ago

It is returning an error, sub has no method matching sub(::Array{Any,1},::Array{Int64,1}) I used the following syntax, wordlist2 = sub(wordlist,find(x -> length(x)>4,wordlist));

Saurabh Wardhane Over a year ago

Got it! Used getindex instead of sub.

Reza Afzalan Over a year ago

getindex() give you a copy, but sub() creates a view, both have the same syntax. above code works for me (VERSION # => v"0.4.0-rc2")

Dan Getz · Accepted Answer · 2015-09-29 20:35:46Z

1

You can use filter

wordlist = filter(x->islenatleast(x,6),wordlist)

and combine it with a fast condition such as islenatleast defined as:

function islenatleast(s,l)
    if sizeof(s)<l return false end
    # assumes each char takes at least a byte
    l==0 && return true
    p=1
    i=0
    while i<l
        if p>sizeof(s) return false end
        p = nextind(s,p)
        i += 1
    end
    return true
end

According to my timings islenatleast is faster than calculating the whole length (in some conditions). Additionally, this shows the strength of Julia, by defining a primitive competitive with the core function length.

But doing:

wordlist = filter(x->length(x)>5,wordlist)

will also do.

edited Sep 29, 2015 at 20:35

answered Sep 29, 2015 at 20:25

Dan Getz

18.3k2 gold badges25 silver badges42 bronze badges

Collectives™ on Stack Overflow

Subset String Array based on length

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related