1

I have a .csv file data.csv stored at location: dbfs:/raw/data/externalTables/emp_data_folder/emp_data.csv

Here is a sample of the data in the file:

Alice,25,50000,North
Bob,30,60000,South
Charlie,35,70000,East
David,40,80000,West
Eve,29,58000,North
Frank,50,90000,South
Grace,28,54000,East
Hannah,32,62000,West
Ian,45,72000,North
Jack,27,56000,South

Using this .csv file, I created an external table in Spark using the following SQL command:

%sql
CREATE TABLE IF NOT EXISTS tablesDbDef.emp_data_f (
    Name STRING,
    Age INTEGER,
    Salary INT,
    Region STRING
)
USING CSV
LOCATION '/raw/data/externalTables/emp_data_folder/'

The table is created successfully, and I can query it without any issues.

Next, I inserted a new record into the table using the following command:

%sql

INSERT INTO tablesDbDef.emp_data_f VALUES ('Mark', 20, 50000, 'South')

The record is inserted successfully and I can see this in sql query. My understanding is that if we insert new data, spark will create new files (.csv files in this case) for the newly inserted data. However, when I check the emp_data_folder directory, I don't see any new files created for this newly inserted record. The only files present are the original emp_data.csv and a newly generated _SUCCESS file.

My question is where is this newly inserted data stored if not in files? Because I can see the newly inserted data in the sql queries but there is no file created for this?

1 Answer 1

1

When you create an external table using USING CSV LOCATION '/path', Spark reads data from the file but doesn’t manage the files or modify them when new data is inserted.

When you use INSERT INTO on an external table, Spark stores the new data in its internal metadata (e.g., Hive Metastore), not in the original CSV file.

Spark treats CSV as read-only and doesn’t append records to it. Instead, the new data is stored in Spark's managed storage, allowing it to be queried but not reflected in the CSV.

To write new data back to files, you’ll need to either convert the table to a managed table or write the updated data to a new location.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for your answer. But when we insert data to a partitioned external tables, the inserted data is written as file to disk. I created a folder and created manual partitions sub folders in it, added data manually in each folder, created table external table and ran MSCK REPAIR TABLE command. And after that, anything I inserted in the table was added as files in data files in respective partition folder. Why file is created in this case but not when the table is not partitioned?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.