How do I parse an html file without using Jsoup?

Question

I need to parse through and HTML file for a homework project, and therefore I can't use Jsoup.

I have tried crawling through the file, but I don't know how to save what I'm looking for.

This is what I have:

    FileInputStream fis = new FileInputStream(filename);
    InputStreamReader inStream = new InputStreamReader(fis);
    BufferedReader reader = new BufferedReader(inStream);

    String fileLine;
    while((fileLine = reader.readLine()) != null){

        String tag = fileLine.substring(fileLine.indexOf("<") + 1,fileLine.indexOf(">"))
    }

I need to find the information inside the title> tags, but I can't figure out how to get that information without getting tags I don't need or how to handle cases where there are no tags.

I want to take the information in the title tag and turn it into a string that I can use.

How is the actual html file? And how is it formatted? Do you need to read it line by line? Posting the actual html file might help. — BlackPearl
– BlackPearl, Commented Apr 8, 2019 at 18:11

Mike de Groot · Accepted Answer · 2019-04-08 18:36:14Z

2

String fileDataString = Files.readAllLines(Paths.get(fileName), Charset.forName("UTF-8")).stream().collect(Collectors.joining("\n"));

String title = StringUtils.substringBetween(fileDataString, "<title>", "</title>"));

This should work to get the text between < title > and < /title >

EDIT: Thank you BlackPearl for the Stream<String>.collect(Collectors.joining("\n")); suggestion

edited Apr 8, 2019 at 18:36

answered Apr 8, 2019 at 18:16

Mike de Groot

364 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

BlackPearl Over a year ago

This approach will work only if the opening and closing title tags are on the same line.

Mike de Groot Over a year ago

Changed it so it first reads the whole file. then looks for the title tags and gets the string in between

BlackPearl Over a year ago

or better, stream().collect(Collectors.joining("\n"))

Line Over a year ago

still not pure Java. StringUtils belong to Apache Commons

Collectives™ on Stack Overflow

How do I parse an html file without using Jsoup?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related