Why does Spark not create a new file after inserting data into an external table?

Question

I have a .csv file data.csv stored at location: dbfs:/raw/data/externalTables/emp_data_folder/emp_data.csv

Here is a sample of the data in the file:

Alice,25,50000,North
Bob,30,60000,South
Charlie,35,70000,East
David,40,80000,West
Eve,29,58000,North
Frank,50,90000,South
Grace,28,54000,East
Hannah,32,62000,West
Ian,45,72000,North
Jack,27,56000,South

Using this .csv file, I created an external table in Spark using the following SQL command:

%sql
CREATE TABLE IF NOT EXISTS tablesDbDef.emp_data_f (
    Name STRING,
    Age INTEGER,
    Salary INT,
    Region STRING
)
USING CSV
LOCATION '/raw/data/externalTables/emp_data_folder/'

The table is created successfully, and I can query it without any issues.

Next, I inserted a new record into the table using the following command:

%sql

INSERT INTO tablesDbDef.emp_data_f VALUES ('Mark', 20, 50000, 'South')

The record is inserted successfully and I can see this in sql query. My understanding is that if we insert new data, spark will create new files (.csv files in this case) for the newly inserted data. However, when I check the emp_data_folder directory, I don't see any new files created for this newly inserted record. The only files present are the original emp_data.csv and a newly generated _SUCCESS file.

My question is where is this newly inserted data stored if not in files? Because I can see the newly inserted data in the sql queries but there is no file created for this?

Prathik Kini · Accepted Answer · 2024-10-15 05:41:57Z

1

When you create an external table using USING CSV LOCATION '/path', Spark reads data from the file but doesn’t manage the files or modify them when new data is inserted.

When you use INSERT INTO on an external table, Spark stores the new data in its internal metadata (e.g., Hive Metastore), not in the original CSV file.

Spark treats CSV as read-only and doesn’t append records to it. Instead, the new data is stored in Spark's managed storage, allowing it to be queried but not reflected in the CSV.

To write new data back to files, you’ll need to either convert the table to a managed table or write the updated data to a new location.

answered Oct 15, 2024 at 5:41

Prathik Kini

1,8101 gold badge19 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

DumbCoder Over a year ago

Thank you for your answer. But when we insert data to a partitioned external tables, the inserted data is written as file to disk. I created a folder and created manual partitions sub folders in it, added data manually in each folder, created table external table and ran MSCK REPAIR TABLE command. And after that, anything I inserted in the table was added as files in data files in respective partition folder. Why file is created in this case but not when the table is not partitioned?

Collectives™ on Stack Overflow

Why does Spark not create a new file after inserting data into an external table?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related