0

I am working with a huge data set of about 10,500 lines that need to be split up into separate parts that include title, date, rating, and length. Here is how the data is formatted: Ghost Blues: The Story of Rory Gallagher (2010) | 3.8 stars, 1hr 21m

I have already figured out how to split the data in half using .split, but I am not sure as to how to split up the first and last half of the title into the title and date when the title has parenthesis in it also, such as: Dhobi Ghat (Mumbai Diaries) (2010) | 3.6 stars, 1hr 42m.

There are also instances in which some of these fields can be empty, so no rating, date or length, and those are also causing me some issues. Can anyone point me in the right direction? Any help would be appreciated!

EDIT: So I forgot to mention (sorry), I need any dates, and ratings as integers because later I will need to be able to apply filters, such as search all entries with rating > 3.5, or movies after 1998, things like that. That throws another wrench in this that I am still working with. Thank you for all the help so far!

6
  • 2001 (1968) - yeah I can see how that might be difficult. Good luck! Commented Nov 27, 2018 at 0:18
  • I think the best thing to do is to never use data from where ever it is you are getting it from because any decent person would store their data in a well known format, such as CSV, JSON etc. That is, if they actually wanted someone else to use it... Commented Nov 27, 2018 at 0:25
  • 1
    Can you add a sample of data covering all edge-cases you want? Commented Nov 27, 2018 at 0:26
  • I guess it would be the last pair of brackets? Commented Nov 27, 2018 at 0:29
  • Can you also add the way you tried splitting it please Commented Nov 27, 2018 at 0:29

2 Answers 2

1

Try this, tested for a couple of edge cases as shown in the comments:-

public static void main(String[] args) {
    String s = "Ghost Blues: The Story of Rory Gallagher (2010) |   3.8 stars, 1hr 21m";
    //String s = "Ghost Blues: The Story of Rory Gallagher |   3.8 stars, 1hr 21m"; //no year
    //String s = "Ghost Blues: The Story of Rory Gallagher (2010) |   3.8 stars"; //no length
    Pattern p = Pattern.compile("(.*?)( (\\((\\d{4})\\)))? \\|\\s+(\\d(\\.\\d)?) stars(, (\\dhr( \\d{1,2}m)?))?");
    Matcher m = p.matcher(s);
    if (m.find()) {
        System.out.println(m.group(1)); //title
        System.out.println(m.group(4)); //year
        System.out.println(m.group(5)); //rating
        System.out.println(m.group(8)); //length
    }
}

Output

Ghost Blues: The Story of Rory Gallagher
2010
3.8
1hr 21m

Can be improved further if you can provide examples of edge cases.

Sign up to request clarification or add additional context in comments.

Comments

0

Here's a solution:

public class Title {
    private String title;
    private String year;
    private String rating;
    private String length;
    public Title(String input) {
        String[] leftRight = input.split("\\|");
        title = leftRight[0].trim();
        int lastParen = title.lastIndexOf("(");
        if (lastParen > 0) {
            year = title.substring(lastParen+1);
            title = title.substring(0, lastParen).trim();
        }
        if (leftRight.length>1) {
            String[] fields = leftRight[1].split(",");
            for (int i = 0; i < fields.length; i++) {
                if (fields[i].contains("stars")) {
                    rating = fields[i].trim();
                } else {
                    length = fields[i].trim();
                }
            }
        }
    }
    @Override
    public String toString() {
        return "Title{" + "title=" + title + ", year=" + year + ", rating=" + rating + ", length=" + length + '}';
    }

    public static void main(String[] args) {
        String[] data = {
            "Ghost Blues: The Story of Rory Gallagher (2010) |   3.8 stars, 1hr 21m",
            "Dhobi Ghat (Mumbai Diaries) (2010) |   3.6 stars, 1hr 42m",
            "just a title",
            "title and rating only | 3.2 stars",
            "title and length only | 1hr 30m"
        };
        for (String titleString : data) {
            Title t = new Title(titleString);
            System.out.println(t);
        }
    }
}

And here's the output from the test data:

Title{title=Ghost Blues: The Story of Rory Gallagher, year=2010), rating=3.8 stars, length=1hr 21m}
Title{title=Dhobi Ghat (Mumbai Diaries), year=2010), rating=3.6 stars, length=1hr 42m}
Title{title=just a title, year=null, rating=null, length=null}
Title{title=title and rating only, year=null, rating=3.2 stars, length=null}
Title{title=title and length only, year=null, rating=null, length=1hr 30m}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.