68

I need my Java program to take a string like:

"This is a sample sentence."

and turn it into a string array like:

{"this","is","a","sample","sentence"}

No periods, or punctuation (preferably). By the way, the string input is always one sentence.

Is there an easy way to do this that I'm not seeing? Or do we really have to search for spaces a lot and create new strings from the areas between the spaces (which are words)?

1

19 Answers 19

87

String.split() will do most of what you want. You may then need to loop over the words to pull out any punctuation.

For example:

String s = "This is a sample sentence.";
String[] words = s.split("\\s+");
for (int i = 0; i < words.length; i++) {
    // You may want to check for a non-word character before blindly
    // performing a replacement
    // It may also be necessary to adjust the character class
    words[i] = words[i].replaceAll("[^\\w]", "");
}
Sign up to request clarification or add additional context in comments.

3 Comments

Could you add explanation about regular expression that you used?
@Marek 1. \\s means space, \\s+ means multiple spaces 2. .replaceAll("[^\\w]", ""); and .replaceAll("\\W", ""); Both of them will replace the characters apart from [a-zA-Z0-9_]. If you want to replace the underscore too, then use: [\\W_]
It works fine, though i have upvote, but this regular expression removes any if there is any special character present !!! Kindly update if there is any normal regular expression which doesn't remove any character
34

Now, this can be accomplished just with split as it takes regex:

String s = "This is a sample sentence with []s.";
String[] words = s.split("\\W+");

this will give words as: {"this","is","a","sample","sentence", "s"}

The \\W+ will match all non-alphabetic characters occurring one or more times. So there is no need to replace. You can check other patterns also.

1 Comment

You may want to start the regex with (?U) to enable Unicode character class, otherwise it will work only with English alphabet.
15

You can use BreakIterator.getWordInstance to find all words in a string.

public static List<String> getWords(String text) {
    List<String> words = new ArrayList<String>();
    BreakIterator breakIterator = BreakIterator.getWordInstance();
    breakIterator.setText(text);
    int lastIndex = breakIterator.first();
    while (BreakIterator.DONE != lastIndex) {
        int firstIndex = lastIndex;
        lastIndex = breakIterator.next();
        if (lastIndex != BreakIterator.DONE && Character.isLetterOrDigit(text.charAt(firstIndex))) {
            words.add(text.substring(firstIndex, lastIndex));
        }
    }

    return words;
}

Test:

public static void main(String[] args) {
    System.out.println(getWords("A PT CR M0RT BOUSG SABN NTE TR/GB/(G) = RAND(MIN(XXX, YY + ABC))"));
}

Ouput:

[A, PT, CR, M0RT, BOUSG, SABN, NTE, TR, GB, G, RAND, MIN, XXX, YY, ABC]

2 Comments

it doesn't split x.y , i.e. "funny.Does it split", returns funny.Does as 1 word
And it probably shouldn't. In English--the code, regrettably, doesn't specify a locale--words are not split by periods.
12

You can also use BreakIterator.getWordInstance.

1 Comment

Wow. The documentation for that looked really nice. An easy way to find the words in the string.
10

Try using the following:

String str = "This is a simple sentence";
String[] strgs = str.split(" ");

That will create a substring at each index of the array of strings using the space as a split point.

Comments

7

You can just split your string like that using this regular expression

String l = "sofia, malgré tout aimait : la laitue et le choux !" <br/>
l.split("[[ ]*|[,]*|[\\.]*|[:]*|[/]*|[!]*|[?]*|[+]*]+");

1 Comment

Good for french. You may add a few things like : "[[ ]*|[,]*|[;]*|[:]*|[']*|[’]*|[\\.]*|[:]*|[/]*|[!]*|[?]*|[+]*]+"
5

The easiest and best answer I can think of is to use the following method defined on the java string -

String[] split(String regex)

And just do "This is a sample sentence".split(" "). Because it takes a regex, you can do more complicated splits as well, which can include removing unwanted punctuation and other such characters.

1 Comment

Guys this is the simplest solution if a sentence does not have punctuation.
4

Use string.replace(".", "").replace(",", "").replace("?", "").replace("!","").split(' ') to split your code into an array with no periods, commas, question marks, or exclamation marks. You can add/remove as many replace calls as you want.

1 Comment

Rather than calling replace 4 times, it would be better to just call it once with a regex that captures any of the 4 items.
4

Try this:

String[] stringArray = Pattern.compile("ian").split(
"This is a sample sentence"
.replaceAll("[^\\p{Alnum}]+", "") //this will remove all non alpha numeric chars
);

for (int j=0; i<stringArray .length; j++) {
  System.out.println(i + " \"" + stringArray [j] + "\"");
}

Comments

2

I already did post this answer somewhere, i will do it here again. This version doesn't use any major inbuilt method. You got the char array, convert it into a String. Hope it helps!

import java.util.Scanner;

public class SentenceToWord 
{
    public static int getNumberOfWords(String sentence)
    {
        int counter=0;
        for(int i=0;i<sentence.length();i++)
        {
            if(sentence.charAt(i)==' ')
            counter++;
        }
        return counter+1;
    }

    public static char[] getSubString(String sentence,int start,int end) //method to give substring, replacement of String.substring() 
    {
        int counter=0;
        char charArrayToReturn[]=new char[end-start];
        for(int i=start;i<end;i++)
        {
            charArrayToReturn[counter++]=sentence.charAt(i);
        }
        return charArrayToReturn;
    }

    public static char[][] getWordsFromString(String sentence)
    {
        int wordsCounter=0;
        int spaceIndex=0;
        int length=sentence.length();
        char wordsArray[][]=new char[getNumberOfWords(sentence)][]; 
        for(int i=0;i<length;i++)
        {
            if(sentence.charAt(i)==' ' || i+1==length)
            {
            wordsArray[wordsCounter++]=getSubString(sentence, spaceIndex,i+1); //get each word as substring
            spaceIndex=i+1; //increment space index
            }
        }
        return  wordsArray; //return the 2 dimensional char array
    }


    public static void main(String[] args) 
    {
    System.out.println("Please enter the String");
    Scanner input=new Scanner(System.in);
    String userInput=input.nextLine().trim();
    int numOfWords=getNumberOfWords(userInput);
    char words[][]=new char[numOfWords+1][];
    words=getWordsFromString(userInput);
    System.out.println("Total number of words found in the String is "+(numOfWords));
    for(int i=0;i<numOfWords;i++)
    {
        System.out.println(" ");
        for(int j=0;j<words[i].length;j++)
        {
        System.out.print(words[i][j]);//print out each char one by one
        }
    }
    }

}

Comments

1

string.replaceAll() doesn't correctly work with locale different from predefined. At least in jdk7u10.

This example creates a word dictionary from textfile with windows cyrillic charset CP1251

    public static void main (String[] args) {
    String fileName = "Tolstoy_VoinaMir.txt";
    try {
        List<String> lines = Files.readAllLines(Paths.get(fileName),
                                                Charset.forName("CP1251"));
        Set<String> words = new TreeSet<>();
        for (String s: lines ) {
            for (String w : s.split("\\s+")) {
                w = w.replaceAll("\\p{Punct}","");
                words.add(w);
            }
        }
        for (String w: words) {
            System.out.println(w);
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

Comments

1

Following is a code snippet which splits a sentense to word and give its count too.

 import java.util.HashMap;
 import java.util.Iterator;
 import java.util.Map;

 public class StringToword {
public static void main(String[] args) {
    String s="a a a A A";
    String[] splitedString=s.split(" ");
    Map m=new HashMap();
    int count=1;
    for(String s1 :splitedString){
         count=m.containsKey(s1)?count+1:1;
          m.put(s1, count);
        }
    Iterator<StringToword> itr=m.entrySet().iterator();
    while(itr.hasNext()){
        System.out.println(itr.next());         
    }
    }

}

Comments

1

Another way to do that is StringTokenizer. ex:-

 public static void main(String[] args) {

    String str = "This is a sample string";
    StringTokenizer st = new StringTokenizer(str," ");
    String starr[]=new String[st.countTokens()];
    while (st.hasMoreElements()) {
        starr[i++]=st.nextElement();
    }
}

Comments

1

Most of the answers here convert String to String Array as the question asked. But Generally we use List , so more useful will be -

String dummy = "This is a sample sentence.";
List<String> wordList= Arrays.asList(dummy.split(" "));

Comments

1

This should help,

 String s = "This is a sample sentence";
 String[] words = s.split(" ");

this will make an array with elements as the string separated by " ".

Comments

0

You can use simple following code

String str= "This is a sample sentence.";
String[] words = str.split("[[ ]*|[//.]]");
for(int i=0;i<words.length;i++)
System.out.print(words[i]+" ");

Comments

0

Here is a solution in plain and simple C++ code with no fancy function, use DMA to allocate a dynamic string array, and put data in array till you find a open space. please refer code below with comments. I hope it helps.

#include<bits/stdc++.h>
using namespace std;

int main()
{

string data="hello there how are you"; // a_size=5, char count =23
//getline(cin,data); 
int count=0; // initialize a count to count total number of spaces in string.
int len=data.length();
for (int i = 0; i < (int)data.length(); ++i)
{
    if(data[i]==' ')
    {
        ++count;
    }
}
//declare a string array +1 greater than the size 
// num of space in string.
string* str = new string[count+1];

int i, start=0;
for (int index=0; index<count+1; ++index) // index array to increment index of string array and feed data.
{   string temp="";
    for ( i = start; i <len; ++i)
    {   
        if(data[i]!=' ') //increment temp stored word till you find a space.
        {
            temp=temp+data[i];
        }else{
            start=i+1; // increment i counter to next to the space
            break;
        }
    }str[index]=temp;
}


//print data 
for (int i = 0; i < count+1; ++i)
{
    cout<<str[i]<<" ";
}

    return 0;
}

Comments

0

TRY THIS....

import java.util.Scanner;

public class test {
    public static void main(String[] args) {

        Scanner t = new Scanner(System.in);
        String x = t.nextLine();

        System.out.println(x);

        String[] starr = x.split(" ");

        System.out.println("reg no: "+ starr[0]);
        System.out.println("name: "+ starr[1]);
        System.out.println("district: "+ starr[2]);

    }
}

Comments

0

Be careful with leading or trailing spaces that would introduce empty tokens, as well as repeated whitespace characters.

I suggest trimming before splitting:

String[] words = s.trim().split("\\s+");

// Input -> " Hi i am om "
// Output -> ["Hi", "i", "am", "om"]

// trim() removes leading and trailing spaces.
// One-line solution to split into words.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.