Can I split a column text as array using data factory data flow?

Question

Inside my data flow pipeline I would like to add a derived column and its datatype is array. I would like to split the existing column with 1000 characters without breaking words. I think we can use regexSplit,

regexSplit(<string to split> : string, <regex expression> : string) => array

But I do not know which regular expression I can use for split the existing column without breaking words. Please help me to figure it out.

By "split the existing column with 1000 characters," do you mean the total column length is currently 1,000 characters and you want to split it in half, or do you want each array element to be 1,000 characters? — jdaz
– jdaz, Commented Jul 5, 2020 at 22:34
Yes @jdaz, I want each array element to be 1000 characters. If the 1000th character is middle of the word, the length of the array element is less than 1000. — Syam Kumar
– Syam Kumar, Commented Jul 6, 2020 at 5:50

Syam Kumar · Accepted Answer · 2020-07-07 17:23:42Z

1

I created a workaround for this and it works fine for me.

filter(split(regexReplace(regexReplace(text, `[\t\n\r]`, ``), `(.{1,1000})(?:\s|$)`, `$1~~`), '~~'), #item !="")

I think, we have a better solution than this.

edited Jul 7, 2020 at 17:23

answered Jul 7, 2020 at 17:09

Syam Kumar

3935 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

gijswijs · Accepted Answer · 2020-07-06 07:11:22Z

0

I wouldn't use a regex for this, but a truncating function like this one, courtesy of TimS:

public static string TruncateAtWord(this string input, int length)
{
    if (input == null || input.Length < length)
        return input;
    int iNextSpace = input.LastIndexOf(" ", length, StringComparison.Ordinal);
    return string.Format("{0}…", input.Substring(0, (iNextSpace > 0) ? iNextSpace : length).Trim());
}

Translated into expression functions it would look something* like this.

substring(Input, 1, iif(locate(Input, ' ', 1000) > 0, locate(Input, ' ', 1000) , length(Input)) )

Since you don't have a lastIndexOf available as an expression function, you would have to default to locate, which means that this expression truncates the string at the first space after the 1000th character.

*I don't have an environment where I can test this.

edited Jul 6, 2020 at 7:11

answered Jul 6, 2020 at 2:33

gijswijs

2,16823 silver badges27 bronze badges

9 Comments

jdaz Over a year ago

Using regexSplit treats matched regex as delimiters that are removed. So I believe any extra letters matched with \S* would be lost.

Syam Kumar Over a year ago

@gijswijs we can't write code inside the column derived schema modifier in data factory data flow. Only possibility is write the regular expressions. I have checked with your regular expression, but it did not work for me.

Syam Kumar Over a year ago

The only option is like regexSplit(text, /.{0,50}\S*(?:\s|$)/m)

jdaz Over a year ago

What about (?<=.{1000})\s? If Azure applies that iteratively, it would work, but if it applies it all at once on the original string it won't.

gijswijs Over a year ago

@SyamKumar Do you have this overload available? learn.microsoft.com/en-us/dotnet/api/… You could use the startAt int to specify the length and then search for the first \s.

|

Collectives™ on Stack Overflow

Can I split a column text as array using data factory data flow?

2 Answers 2

Comments

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related