0

I am parsing a web page, but I got some problem, the page consists a lot of such elements:

<div class="tweet">
        <a href="https://twitter.com/Sweden" target="_blank" class="tweet__link">@sweden</a>
        <span class="tweet__timestamp"><a href="https://twitter.com/sweden/status/694285861026926594" target="_blank" class="tweet__permalink">Feb. 1, 2016, 11:27 p.m.</a></span>
        <p class="tweet__content"><a href='http://twitter.com/UnbatedFlunky' target='_blank'>@UnbatedFlunky</a> Good to know. :)</p>
    </div>

    <div class="tweet">
        <a href="https://twitter.com/Sweden" target="_blank" class="tweet__link">@sweden</a>
        <span class="tweet__timestamp"><a href="https://twitter.com/sweden/status/694285696140513280" target="_blank" class="tweet__permalink">Feb. 1, 2016, 11:26 p.m.</a></span>
        <p class="tweet__content">RT <a href='http://twitter.com/UnbatedFlunky' target='_blank'>@UnbatedFlunky</a>: .<a href='http://twitter.com/sweden' target='_blank'>@sweden</a> exactly the kind of content I'd want representing my country. 10/10</p>
    </div>

I want to put the content in each tweet class in separate string, I have so far this code:

Document doc = Jsoup.connect("http://curatorsofsweden.com/curator/aleksandra-boscanin/").get();
        Element e = doc.select("div").first();
            String text = doc.getElementsByClass("tweet").text();

but in this way I am storing all the content in one single string, but my question is how I can put them separately for example lets say String array :/ Maybe its a stupid question but I could not make it work :/

3
  • Have you considered creating a Model first, then trying to serialize it? Commented Feb 2, 2016 at 11:26
  • what do you mean by model :/ Commented Feb 2, 2016 at 11:33
  • 1
    It's an entity (in OOP) that mirrors the HTML (DOM) structure. You should read up on serialization and de-serialization. The Apache Xerces would be an easy framework to implement for deserializing HTML into classes: xerces.apache.org/xerces2-j/faq-dom.html#faq-3 Commented Feb 2, 2016 at 12:11

1 Answer 1

2

doc.getElementsByClass("tweet") returns an array over which you should iterate and create an array entry for each of the tweet elements. For example

List<String> stringList = new ArrayList<>();
List<Element> tweets = doc.getElementsByClass("tweet");
for(Element tweet : tweets){
    stringList.add(tweet.text()); 
}

the texts will be in the stringList list.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.