transform data in azure data factory using python data bricks

Question

I have the task to transform and consolidate millions of single JSON file into BIG CSV files.

The operation would be very simple using a copy activity and mapping the schemas, I have already tested, the problem is that a massive amount of files have bad JSON format.

I know what is the error and the fix is very simple too, I figured that I could use a Python Data brick activity to fix the string and then pass the output to a copy activity that could consolidate the records into a big CSV file.

I have something in mind like this, I'm not sure if this is the proper way to address this task. I don't know to use the output of the Copy Activy in the Data Brick activity

Peter Pan · Accepted Answer · 2019-06-28 06:06:58Z

It sounds like you want to transform a large number of single JSON file using Azure Data Factory, but it does not support on Azure now as @KamilNowinski said. However, now that you were using Azure Databricks, to write a simple Python script to do the same thing is easier for you. So a workaound solution is to directly use Azure Storage SDK and pandas Python package to do that via few steps on Azure Databricks.

Maybe these JSON files are all in a container of Azure Blob Storage, so you need to list them in container via list_blob_names and generate their urls with sas token for pandas read_json function, the code as below.

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import ContainerPermissions
from datetime import datetime, timedelta

account_name = '<your account name>'
account_key = '<your account key>'
container_name = '<your container name>'

service = BaseBlobService(account_name=account_name, account_key=account_key)
token = service.generate_container_shared_access_signature(container_name, permission=ContainerPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

blob_names = service.list_blob_names(container_name)
blob_urls_with_token = (f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}?{token}" for blob_name in blob_names)

#print(list(blob_urls_with_token))

Then, you can read these JSON file directly from blobs via read_json function to create their pandas Dataframe.
```
import pandas as pd

for blob_url_with_token in blob_urls_with_token:
    df = pd.read_json(blob_url_with_token)
```
Even if you want to merge them to a big CSV file, you can first merge them to a big Dataframe via pandas functions listed in Combining / joining / merging like append.
To write a dataframe to a csv file, I think it's very easy by to_csv function. Or you can convert a pandas dataframe to a PySpark dataframe on Azure Databricks, as the code below.
```
from pyspark.sql import SQLContext
from pyspark import SparkContext

sc = SparkContext()
sqlContest = SQLContext(sc)
spark_df = sqlContest.createDataFrame(df)
```

So next, whatever you want to do, it's simple. And if you want to schedule the script as notebook in Azure Databricks, you can refer to the offical document Jobs to run Spark jobs.

Hope it helps.

Thank you! This helps, I won't be able to read the JSON files directly with Pandas because a lot of files have errors, actually, I already have a Python script that reads the files from blob fixes them, if necessary, creates a CSV stream and put it back to another Blob container, I was running it locally, now I will try on Databricks, those files could be pretty big, I´ve counted as much as 115k records per day of logs, and there are a couple of months for processing, I don't know if it will cause problems with the memory

Kamil Nowinski · Accepted Answer · 2019-06-27 20:21:04Z

0

Copy JSON file to storage (e.g. BLOB) and you can get access to the storage from Databricks. Then you can fix the file using Python and even transform to the required format having cluster run.

So, in Copy Data activity do the copy of the files to BLOB if you haven't them there yet.

answered Jun 27, 2019 at 20:21

Kamil Nowinski

5454 silver badges11 bronze badges

Collectives™ on Stack Overflow

transform data in azure data factory using python data bricks

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related