3

Inside my data flow pipeline I would like to add a derived column and its datatype is array. I would like to split the existing column with 1000 characters without breaking words. I think we can use regexSplit,

regexSplit(<string to split> : string, <regex expression> : string) => array

But I do not know which regular expression I can use for split the existing column without breaking words. Please help me to figure it out.

2
  • By "split the existing column with 1000 characters," do you mean the total column length is currently 1,000 characters and you want to split it in half, or do you want each array element to be 1,000 characters? Commented Jul 5, 2020 at 22:34
  • Yes @jdaz, I want each array element to be 1000 characters. If the 1000th character is middle of the word, the length of the array element is less than 1000. Commented Jul 6, 2020 at 5:50

2 Answers 2

1

I created a workaround for this and it works fine for me.

filter(split(regexReplace(regexReplace(text, `[\t\n\r]`, ``), `(.{1,1000})(?:\s|$)`, `$1~~`), '~~'), #item !="")

I think, we have a better solution than this.

Sign up to request clarification or add additional context in comments.

Comments

0

I wouldn't use a regex for this, but a truncating function like this one, courtesy of TimS:

public static string TruncateAtWord(this string input, int length)
{
    if (input == null || input.Length < length)
        return input;
    int iNextSpace = input.LastIndexOf(" ", length, StringComparison.Ordinal);
    return string.Format("{0}…", input.Substring(0, (iNextSpace > 0) ? iNextSpace : length).Trim());
}

Translated into expression functions it would look something* like this.

substring(Input, 1, iif(locate(Input, ' ', 1000) > 0, locate(Input, ' ', 1000) , length(Input)) )

Since you don't have a lastIndexOf available as an expression function, you would have to default to locate, which means that this expression truncates the string at the first space after the 1000th character.

*I don't have an environment where I can test this.

9 Comments

Using regexSplit treats matched regex as delimiters that are removed. So I believe any extra letters matched with \S* would be lost.
@gijswijs we can't write code inside the column derived schema modifier in data factory data flow. Only possibility is write the regular expressions. I have checked with your regular expression, but it did not work for me.
The only option is like regexSplit(text, /.{0,50}\S*(?:\s|$)/m)
What about (?<=.{1000})\s? If Azure applies that iteratively, it would work, but if it applies it all at once on the original string it won't.
@SyamKumar Do you have this overload available? learn.microsoft.com/en-us/dotnet/api/… You could use the startAt int to specify the length and then search for the first \s.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.