0

Recently, I have been looking for a way to query csv data files in blob storage with python.

I currently have a naive solution which pulls down the file and then filters out uneeded rows, but I recently came across this Azure github example repo as well as this stack overflow question, which references these official azure sdk docs.

This looks like what I want. My questions are:

  1. What SQL variant is it using? And where can I find it's doc's and supported features?
  2. Does it support querying csv files (or text files in general) line by line? I would like to be able to write a query like select line from file where line contains {search_string}. Does this blob query have that capability?

1 Answer 1

1

What SQL variant is it using? And where can I find its docs and supported features?

To query data stored in blobs, the query_blob function in the Azure Storage client library for Python uses SQL-like syntax. Although the syntax is similar to SQL, it is not a complete SQL implementation.

You can use this MS-DOCS to learn information about the syntax and supported features.

Does it support querying CSV files (or text files in general) line by line? I would like to be able to write a query like select line from file where line contains {search_string}. Does this blob query have that capability?

According to your second question, the query_blob technique does not allow you to query CSV files (or text files in general) line by line. Instead, it allows you to query blob data as a single block of text.

If you want to query CSV data line by line, you must first read the file and then apply your search criteria to each line.

In my environment, I have stored CSV file with data in Azure blob storage like the below:

enter image description here

Here is the below code to read a CSV file line by line and it looks for lines that contain a specified string.

Code:

from azure.storage.blob import BlobServiceClient
import pandas as pd

blob_service_client = BlobServiceClient.from_connection_string("Your-storage-connection-string")
blob_client = blob_service_client.get_blob_client(container="test", blob="sample.csv") 

# Read the csv file line by line and search for lines that contain a specific string
search_string = "44.95"
lines=blob_client.download_blob().content_as_text().splitlines()
for i in range(len(lines)):
    if i==0:
        columns = [lines[i].split(',')]
    else:
        if search_string in lines[i]:
            data = [lines[i].split(',')]
df = pd.DataFrame(data, columns=columns)

print(df)

Output:

                   Title         Author     Genre  Price PublishDate                                  Description
0  XML Developer's Guide  "Gambardella"  Computer  44.95  2000-10-01  "An in-depth look at creating applications 

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for providing clarification and docs.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.