I am trying to find out if a column is binary or not. If a column is only having 1 or 0 then I am flagging it as binary, else non binary. Since I have a database background, I tried to achieve it through a SQL like statement. But when the file is big, this code is not giving me good performance.
Could you please suggest how I can improve this code:
input_data=spark.read.csv("/tmp/sample.csv", inferSchema=True,header=True)
input_data.createOrReplaceTempView("input_data")
totcount=input_data.count()
from pyspark.sql.types import StructType,StructField,StringType
profSchema = StructType([ StructField("column_name", IntegerType(), True)\
,StructField("binary_type", StringType(), True)])
fin_df=spark.createDataFrame([],schema=profSchema) ##create null df
for colname in input_data.columns:
query="select {d_name} as column_name,case when sum(case when {f_name} in ( 1,0) then 1 else 0 end) == {tot_cnt} then 'binary' else 'nonbinary'\
end as binary_stat from input_data".format(f_name=colname, d_name="'"+str(colname)+"'",tot_cnt=totcount)
join_df=spark.sql(query)
fin_df=fin_df.union(join_df)

import spark?). \$\endgroup\$