Copy schema from one dataframe to another dataframe

Question

I'm trying to change the schema of an existing dataframe to the schema of another dataframe.

DataFrame 1:

Column A | Column B | Column C | Column D
   "a"   |    1     |   2.0    |   300
   "b"   |    2     |   3.0    |   400
   "c"   |    3     |   4.0    |   500

DataFrame 2:

Column K | Column B | Column F
   "c"   |    4     |   5.0
   "b"   |    5     |   6.0
   "f"   |    6     |   7.0

So I want to apply the schema of the first dataframe on the second. So all the columns which are the same remain. The columns in dataframe 2 that are not in 1 get deleted. The others become "NULL".

Output

Column A | Column B | Column C | Column D
 "NULL"  |    4     |   "NULL" |  "NULL"
 "NULL"  |    5     |   "NULL" |  "NULL"
 "NULL"  |    6     |   "NULL" |  "NULL"

So I came with a possible solution:

val schema = df1.schema
val newRows: RDD[Row] = df2.map(row => {
  val values = row.schema.fields.map(s => {
    if(schema.fields.contains(s)){
      row.getAs(s.name).toString
    }else{
      "NULL"
    }
  })
  Row.fromSeq(values)
})
sqlContext.createDataFrame(newRows, schema)}

Now as you can see this will not work because the schema contains String, Int and Double. And all my rows have String values.

This is where I'm stuck, is there a way to automatically convert the type of my values to the schema?

zero323 · Accepted Answer · 2016-04-22 14:51:03Z

12

If schema is flat I would use simply map over per-existing schema and select required columns:

val exprs = df1.schema.fields.map { f => 
  if (df2.schema.fields.contains(f)) col(f.name)
  else lit(null).cast(f.dataType).alias(f.name) 
}

df2.select(exprs: _*).printSchema

// root
//  |-- A: string (nullable = true)
//  |-- B: integer (nullable = false)
//  |-- C: double (nullable = true)
//  |-- D: integer (nullable = true)

answered Apr 22, 2016 at 14:51

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

RudyVerboven Over a year ago

Thanks for the reply ! I gave it a try and it worked, exactly what I needed! (cannot upvote yet)

pallupz Over a year ago

How do I do this in PySpark? I have this exact same requirement but in Python

sidmore · Accepted Answer · 2021-05-05 09:35:25Z

2

Below are simple PYSPARK steps to achieve same:

df = <dataframe whose schema needs to be copied>
df_tmp = <dataframe with result with fewer fields> 
#Note: field names from df_tmp must match with field names from df

df_tmp_cols = [colmn.lower() for colmn in df_tmp.columns]
for col_dtls in df.dtypes:
  col_name, dtype = col_dtls
  if col_name.lower() in df_tmp_cols:
    df_tmp = df_tmp.withColumn(col_name,f.col(col_name).cast(dtype))
  else:
    df_tmp = df_tmp.withColumn(col_name,f.lit(None).cast(dtype)) 
df_fin = df_tmp.select(df.columns) #Final dataframe

answered May 5, 2021 at 9:35

sidmore

211 bronze badge

Comments

Antonio Cachuan · Accepted Answer · 2018-10-18 18:48:15Z

1

Working in 2018 (Spark 2.3) reading a .sas7bdat

Scala

val sasFile = "file.sas7bdat"
val dfSas = spark.sqlContext.sasFile(sasFile)
val myManualSchema = dfSas.schema //getting the schema from another dataframe
val df = spark.read.format("csv").option("header","true").schema(myManualSchema).load(csvFile)

PD: spark.sqlContext.sasFile use saurfang library, you could skip that part of code and get the schema from another dataframe.

answered Oct 18, 2018 at 18:48

Antonio Cachuan

4851 gold badge9 silver badges22 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 11:45:22Z

0

You could simply do Left Join on your dataframes with query like this:-

SELECT Column A, Column B, Column C, Column D FROM foo LEFT JOIN BAR ON Column C = Column C

Please checkout the answer by @zero323 in this post:-

Spark specify multiple column conditions for dataframe join

Thanks, Charles.

edited May 23, 2017 at 11:45

CommunityBot

11 silver badge

answered Apr 22, 2016 at 14:38

charles gomes

2,14512 silver badges15 bronze badges

1 Comment

RudyVerboven Over a year ago

Thanks for the reply, I edited my question. It is important to note that the dataframes are not relational.

Collectives™ on Stack Overflow

Copy schema from one dataframe to another dataframe

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related