0

Working on a user-defined function on BigQuery to extract emails from messy data sets, I'm facing an issued with ARRAY_AGG() not being allowed in the body of a temp user defined-function (UDF).

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
    ARRAY_AGG(
        DISTINCT 
        (SELECT * FROM 
            UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e where e like '%@%'
        ) IGNORE NULLS
    )[SAFE_OFFSET(index)]
);

SELECT GET_EMAIL(["[email protected],[email protected]", "12345", "[email protected]"],1) as email_1

I've tried to bypass the ARRAY_AGG by selecting from UNNEST with OFFSET and then WHERE the offset would be the index.

However, now there's a column limitation (not more than one column in inside a scalar sub-query SELECT clause) suggesting to use a SELECT AS STRUCT instead.

I gave a try to the SELECT AS STRUCT:

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS (
   
    (SELECT AS STRUCT DISTINCT list.e, list.o FROM 
        UNNEST(
            SPLIT(
                REPLACE(
                    LOWER(
                        ARRAY_TO_STRING(emails, ", ")
                    )," ", ""
                )
            )
        ) AS list
        WITH OFFSET as list.o
        WHERE list.e like '%@%' AND list.o = index)
);

SELECT GET_EMAIL(["[email protected],[email protected]", "12345", "[email protected]"],1) as email_1

But it doesn't like my DISTINCT and then even removing it, it will complain about parsing e and o.

So I'm out of ideas here, I probably made a knot. Can anyone suggest how to do this job inside a UDF? Thanks.

1 Answer 1

1

Below version works

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
    SELECT ARRAY(
        SELECT * 
          FROM UNNEST(
                SPLIT(
                    REPLACE(
                        LOWER(
                            ARRAY_TO_STRING(emails, ",")
                        )," ", ""
                    )
                )
            ) AS e WHERE e LIKE '%@%'
    )[SAFE_OFFSET(index)]
));
SELECT GET_EMAIL(["[email protected],[email protected]", "12345", "[email protected]"], 1) AS email_1

with result

Row email_1  
1   [email protected]   

Or below version (which is just slight correction of your original query)

CREATE TEMP FUNCTION GET_EMAIL(emails ARRAY<STRING>, index INT64) AS ((
  SELECT ARRAY_AGG(e)[SAFE_OFFSET(index)] 
  FROM UNNEST(
        SPLIT(
            REPLACE(
                LOWER(
                    ARRAY_TO_STRING(emails, ",")
                )," ", ""
            )
        )
    ) AS e WHERE e LIKE '%@%'
));
SELECT GET_EMAIL(["[email protected],[email protected]", "12345", "[email protected]"], 1) AS email_1     

obviously with the same result

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.