Java patterns: engineering data flows for data mining tasks

Question

I am a data miner, an as such, I spend a lot of time transforming raw data in various ways to enable consumption by predictive models. For instance, read a file in a certain format, tokenize, gram-ify, and project into some numeric representation. Over the years I have developed a rich set of methods to do most of the data processing tasks i can think of, but I dont have a nice way of configuring these components in all but the most rudimentary ways- typically what i do is a lot of calls to specific methods in the source code that is dependent on a specific task. I'm now trying to refactor my libraries into something that's much nicer, but i'm not too sure what this is.

My current thinking is, have a list of function objects, each defining some method (say, operate( ... ) ), that are called in sequence, each either processing the contents of some data flow by reference, or consuming the output of the previous function object. This is close to what I want, but because the type of data being input and output will vary, using generics becomes very difficult. To use my above example, i'd like to pass something through this "pipeline" that processes data like:

input: string filename
filename -> collection of strings
collection<string> -> (stemming, stopword removal) -> collection of strings
collection<string> -> (tokenize) -> collection of string arrays
collection<string[]> -> (gram-ify) -> augment individual token strings with n-grams -> collection of string arrays
collection<string[]> -> projection into numeric vectors -> collection< double[] >

this is a simple example, but imagine i have 100s of such components, and i'd like to add them to some data flow. this meets my easy to configure requirement- i could easily built a pipeline factory that reads some yaml file and builds this out. however, the design patterns of the components has been stumping me for a while? what do the appropriate interfaces look like? it seems like the only easy way to do things here is have objects get passed, essentially doing away with objects (or have some context object get passed that has a Object as a member variable), then checking for compatibility at input, throwing runtime exceptions. both options seem equally bad. however, i feel like i'm close to a really nice and flexible system here. can you guys help me push this over the fence?

dMb · Accepted Answer · 2011-11-10 02:14:20Z

1

The apache foundation has a project called pipelines https://commons.apache.org/sandbox/pipeline/. Perhaps it can be of use. I thought there were more pipeline based projects there. It might be useful to browse around that site.

answered Nov 10, 2011 at 2:14

dMb

9,3953 gold badges50 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

dMb Over a year ago

you can download the source code which, looks like, comes with a project maven file. You can compile it into a jar yourself. You can also just look at the design of the interfaces to get an idea of what they are doing. Also, there is another Apache project called Cocoon that also uses a pipeline approach. You may want to look at that: cocoon.apache.org

BillRobertson42 · Accepted Answer · 2011-11-10 02:23:00Z

1

I think a more nimble tool to tie your library together would be a good approach. e.g. One of the new dynamic languages would be great for that.

Clojure would be a great fit with tools like map, pmap, reduce filter etc. all built in. Clojure's collections all implement the interfaces of the java.util Collection library, so you can apply higher level Clojure functions to your existing Java code, or you can also pass Clojure data structures directly to your java code (as long as the Java code does not expect to modify it).

The lightweight and dynamic nature of the language makes it easy to put things together quickly without a lot of overhead too.

answered Nov 10, 2011 at 2:23

BillRobertson42

12.9k4 gold badges44 silver badges62 bronze badges

2 Comments

downer Over a year ago

it's probably a bit much to to learn a new language though, no?

BillRobertson42 Over a year ago

That's up to you really. Its not that hard to get started, especially if you're using it as a glue language. You've already got stuff that works. So if you take some time to start learning Clojure and you begin to find little ways to apply it you will benefit. If you decide you don't like it then you can move on.

toto2 · Accepted Answer · 2011-11-10 03:17:14Z

1

I might be reading your example too literally; meaning that this solution might not be applicable to your real problem.

public interface Interface1 {
  public List<String> operate(List<String> list);
}

public interface InterfaceBridge {
  public List<List<String>> operate(List<String> list);
}

public interface Interface2 {
  public List<List<String>> operate(List<List<String>> list);
}

You should obviously pick better interface names. You can then compose them with:

public class Interface1Composite implements Interface1 {
  List<Interface1> components = new ArrayList<>();

  public Interface1Composite(Interface1... components) {
    for (Interface1 i1 : components)
      this.components.add(i1);
  }

  @Override 
  public List<String> operate(List<String> list) {
    for (Interface1 i1 : components)
      list = i1.operate(list);
    return list;
  }

I guess it's pretty much what you are already doing. I just simplified by having 3 types of interfaces instead of trying to use generics. But as I said earlier, I don't know if you can apply that to your problem.

edited Nov 10, 2011 at 3:17

answered Nov 10, 2011 at 3:05

toto2

5,32623 silver badges24 bronze badges

1 Comment

downer Over a year ago

so that works for string examples, but what about arbitrary data types? particularly mixed types? a map of some identifier to the data structure itself? additionally, i didnt make this clear initially- some times i'd like to pass through single examples (eg, when used as a component in some web service handling individual requests) and other times, collections, when used in some batch mode. a 1 item list is ok, but a bit ugly i'd say. more restrictive- some components require all relevant data to be present before processing and passing output, others dont have this restriction.

Collectives™ on Stack Overflow

Java patterns: engineering data flows for data mining tasks

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related