I am a data miner, an as such, I spend a lot of time transforming raw data in various ways to enable consumption by predictive models. For instance, read a file in a certain format, tokenize, gram-ify, and project into some numeric representation. Over the years I have developed a rich set of methods to do most of the data processing tasks i can think of, but I dont have a nice way of configuring these components in all but the most rudimentary ways- typically what i do is a lot of calls to specific methods in the source code that is dependent on a specific task. I'm now trying to refactor my libraries into something that's much nicer, but i'm not too sure what this is.
My current thinking is, have a list of function objects, each defining some method (say, operate( ... ) ), that are called in sequence, each either processing the contents of some data flow by reference, or consuming the output of the previous function object. This is close to what I want, but because the type of data being input and output will vary, using generics becomes very difficult. To use my above example, i'd like to pass something through this "pipeline" that processes data like:
input: string filename
filename -> collection of strings
collection<string> -> (stemming, stopword removal) -> collection of strings
collection<string> -> (tokenize) -> collection of string arrays
collection<string[]> -> (gram-ify) -> augment individual token strings with n-grams -> collection of string arrays
collection<string[]> -> projection into numeric vectors -> collection< double[] >
this is a simple example, but imagine i have 100s of such components, and i'd like to add them to some data flow. this meets my easy to configure requirement- i could easily built a pipeline factory that reads some yaml file and builds this out. however, the design patterns of the components has been stumping me for a while? what do the appropriate interfaces look like? it seems like the only easy way to do things here is have objects get passed, essentially doing away with objects (or have some context object get passed that has a Object as a member variable), then checking for compatibility at input, throwing runtime exceptions. both options seem equally bad. however, i feel like i'm close to a really nice and flexible system here. can you guys help me push this over the fence?