0

I have a need to query from a parquet file where the column names are completely inconsistent. In order to remedy this issue and insure that my model gets exactly the data it expects I need to 'prefetch' the columns list then apply some regex patterns to qualify which columns I need to retrieve. In pseudocode:

PrefetchList = sqlContext.read.parquet(my_parquet_file).schema.fields
# Insert variable statements to check/qualify the columns against rules here
dfQualified = SELECT [PrefetchList] from parquet;

I've searched around to see if this is achievable but not had any success. If this is syntactically correct (or close) or if someone has other suggestions I am open to it.

Thanks

1
  • I see the tags pyspark and I am assuming you are using Python buy If you are fine with using scala , case class can solve your problem effectively. Commented Nov 20, 2017 at 22:37

1 Answer 1

1

You can use either the schema method but you can also use the .columns method. Notice that the select method in spark is a little odd, it's defined as def select(col: String, cols: String*) so you can't pass back to it select(fields:_*), and you'd have to use df.select(fields.head, fields.tail:_*) which is kind of ugly, but luckily there's selectExpr(exprs: String*) as an alternative. So this below will work. It takes only columns that begin with 'user'

fields = df.columns.filter(_.matches("^user.+")) // BYO regexp
df.selectExpr(fields:_*)

This of course assumes that df contains your dataframe, loaded with sqlContext.read.parquet().

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.