Binary check code in pyspark

Question

I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else non binary. Since I have a database background, I tried to achieve it through a SQL like statement. But when the file is big, this code is not giving me good performance.

Could you please suggest how I can improve this code:

input_data=spark.read.csv("/tmp/sample.csv", inferSchema=True,header=True)
input_data.createOrReplaceTempView("input_data")
totcount=input_data.count()

from pyspark.sql.types import StructType,StructField,StringType
profSchema = StructType([ StructField("column_name", IntegerType(), True)\
                   ,StructField("binary_type", StringType(), True)])
fin_df=spark.createDataFrame([],schema=profSchema) ##create null df

for colname in input_data.columns:
    query="select {d_name} as column_name,case when sum(case when {f_name} in ( 1,0) then 1 else 0 end) == {tot_cnt} then 'binary' else 'nonbinary'\
    end as binary_stat from input_data".format(f_name=colname, d_name="'"+str(colname)+"'",tot_cnt=totcount)
    join_df=spark.sql(query)
    fin_df=fin_df.union(join_df)

You are talking about "better performance". Better than what? — AlexV
– AlexV, Commented Mar 26, 2019 at 10:03
Better than the current script. If the file is having more column, i am looping it that many times and adding it to the prev data frame. — Shankar Panda
– Shankar Panda, Commented Mar 26, 2019 at 10:04
Can you add the last part of the image as text (the result of the code being run), then you don't need the image at all anymore. Also add your other imports, please (I guess it is only import spark?). — Graipher
– Graipher, Commented Mar 26, 2019 at 11:38

foxale · Accepted Answer · 2019-04-05 14:24:33Z

1

This one is O(1) in terms of pyspark collect operations instead of previous answers, both of which are O(n), where n = len(input_df.columns).

def get_binary_cols(input_file: pyspark.sql.DataFrame) -> List[str]:
    distinct = input_file.select(*[collect_set(c).alias(c) for c in input_file.columns]).take(1)[0]
    print(distinct)
    print({c: distinct[c] for c in input_file.columns})
    binary_columns = [c for c in input_file.columns
                      if len(distinct[c]) == 2
                      and (set(distinct[c]).issubset({'1', '0'}) or set(distinct[c]).issubset({1, 0}))]
    return binary_columns

For ~100 columns and 100 rows I've gained around 80x boost in performance.

edited Apr 5, 2019 at 14:24

answered Apr 5, 2019 at 8:43

foxale

262 bronze badges

\$\begingroup\$ Seems like your alternative approach is missing a return statement. Also consider to explain critical parts of your approach a little further. At the moment I would not consider this as a good answer for Code Review. \$\endgroup\$

AlexV
– AlexV

2019-04-05 12:33:32 +00:00
Commented Apr 5, 2019 at 12:33

Add a comment |

Graipher · Accepted Answer · 2019-03-26 13:05:41Z

Spark dataframes (and columns) have a distinct method, which you can use to get all values in that column. Getting the actual values out is a bit more complicated and taken from this answer to a similar question on StackOverflow:

from pyspark.sql import SparkSession

def get_distinct_values(data, column):
    return {x[column] for x in data.select(column).distinct().collect()}

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

input_data = spark.read.csv("/tmp/sample.csv", inferSchema=True, header=True)
input_data.createOrReplaceTempView("input_data")

print({c: get_distinct_values(input_data, c) == {True, False}
       for c in input_data.columns})
# {'category': False, 'logged_in': True, 'gid': False, 'pol_id': False, 'subcategory': False}

I don't know enough about spark to know how you would cast this back into a spark dataframe, but this should get you at least halfway there and be a bit faster, since it can do the fastest implementation it can to reduce the values to sets.

# /tmp/sample.csv
pol_id,gid,category,subcategory,logged_in
1,1,A,a,1
2,1,A,b,0
1,2,B,b,1
2,2,B,a,0

2 revs · Accepted Answer · 2019-04-04 05:21:11Z

-1

In my earlier approach, I was taking the count of 1 and 0 of a column also the total record count. If there is a match then I was flagging it as binary else non binary. But as per this approach, I am limiting the distinct record to 3. Then checking if the number of record is 2 and it has 0 or 1

columns_isbinary = []
for column_name in input_data.columns:
    column = input_data.select(column_name)
    column_distinct_3 = column.distinct().limit(3).rdd.flatMap(lambda x: x).collect()            
    column_distinct_3_str = [str(value) for value in column_distinct_3]
    column_isbinary = len(column_distinct_3) == 2 and\
     all(value in column_distinct_3_str for value in ('1', '0'))
    columns_isbinary.append((column_name, column_isbinary))

is_binarydf=spark.createDataFrame(columns_isbinary,("column_name","isbinary"))

edited Apr 4, 2019 at 5:21

community wiki

2 revs
Shankar Panda

\$\begingroup\$ You have presented an alternative solution, but haven't reviewed the code. Please explain your reasoning (how your solution works and why it is better than the original) so that the author and other readers can learn from your thought process. \$\endgroup\$

Jamal
– Jamal

2019-03-31 18:09:43 +00:00
Commented Mar 31, 2019 at 18:09
\$\begingroup\$ I am the author. In my earlier approach, I was taking the count of 1 and 0 of a column also the total record count. If there is a match then I was flagging it as binary else non binary. But as per this approach, I am limiting the distinct record to 3. Then checking if the number of record is 2 and it has 0 or 1. Which is a good approach. I hope you might now upgrade the answer. You can down grade if the solution does not work. \$\endgroup\$

Shankar Panda
– Shankar Panda

2019-04-01 04:28:13 +00:00
Commented Apr 1, 2019 at 4:28
\$\begingroup\$ Please put that into the post itself. \$\endgroup\$

Jamal
– Jamal

2019-04-02 01:46:05 +00:00
Commented Apr 2, 2019 at 1:46

Add a comment |

Stack Exchange Network

Binary check code in pyspark

3 Answers 3

You must log in to answer this question.

Hot Network Questions

Binary check code in pyspark

3 Answers 3

You must log in to answer this question.

Related

Hot Network Questions