3

I'm trying to extract the tabular data from apache pyspark logs using a perl one-liner.

Below is the sample file log and there are 3 tabular output from the spark output:

24/06/19 01:00:00 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:01 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:02 INFO org.apache.spark.scheduler.DAGScheduler: Registering RDD 0 (text at <stdin>:1)
24/06/19 01:00:03 DEBUG pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: This is a debug message from my PySpark code.
+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
24/06/19 01:00:04 INFO pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: Processing data in MyPySparkApp.
24/06/19 01:00:05 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on host localhost: Executor lost connection, trying to reconnect.
24/06/19 01:00:07 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:08 INFO org.apache.spark.SparkContext: Stopped SparkContext
+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

24/06/19 01:00:09 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:09 INFO org.apache.spark.SparkContext: Stopped SparkContext

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

root
 |-- empno: integer (nullable = true)
 |-- ename: string (nullable = true)
 |-- job: string (nullable = true)
 |-- mgr: integer (nullable = true)
 |-- hiredate: string (nullable = true)
 |-- sal: integer (nullable = true)
 |-- comm: integer (nullable = true)
 |-- deptno: integer (nullable = true)
 
24/06/19 01:00:20 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:21 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:22 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:23 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp2

When there is only one tabular output the below command works:

perl -0777 -ne ' while(m/^\x2b(.+)\x2b$/gsm) { print "$&\n" } ' spark.log # \x2b="+"

but for multiple tabular outputs, it pulls all the text from first occurrence to end of last tabular occurrence. How do I get all the 3 tabular output from my sample log?

Expected output:

Table-1:

+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+

Table-2

+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

Table-3

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+
6
  • .+ = problem ? Commented Jun 18 at 20:06
  • 1
    perl -ne'print if /^[|+]/' spark.log Commented Jun 18 at 20:11
  • 1
    Though that wont separate the different tables from each other Commented Jun 18 at 20:12
  • 1
    /^\+-.*?-\+\R(?![+|])/sgm regex101.com/r/4fwfWs/1 Commented Jun 18 at 20:18
  • 1
    Tip: \+ is more readable than \x2b Commented Jun 18 at 23:05

3 Answers 3

4

The problem with your attempt is that . matches any characters (including Line Feed characters) when the s modifier is used.


The following will do the trick, and it does it without loading the entire log file into memory:

perl -ne'print if /^[|+]/' spark.log 

Same idea, using grep:

grep '^[|+]' spark.log

The following version identifies the individual tables (allowing you to do something at the table level):

perl -gne'
   print "Table ", ++$i, ":\n", $&
      while /
         ^
         ( \+ .* \+\n )
         \| .* \|\n
         \1
         (?: \| .* \|\n )*
         \1
      /xmg;
' spark.log

The same, but using fewer lines:

perl -gne'
   print "Table ", ++$i, ":\n", $&
      while /^(\+.*\+\n)\|.*\|\n\1(?:\|.*\|\n)*\1/mg;
' spark.log

These are stricter as to what is considered a table, so they might work better than the first solution. Also, I'm assuming it's not a problem to read the whole log file into memory. It's not necessary, but it's simpler to write.

(Use -0777 instead of -g if your Perl is too old to support -g.)

Sign up to request clarification or add additional context in comments.

4 Comments

You could replace "Table "... with a placeholder that could be used for HTML output
print "Table ", ++$i, ":\n", $& is a placeholder for whatever you want do with the table ($&).
I guess that is true. OP didnt really define how to finalize his output
@TLP, For all I know, the first solution is good enough. I just thought the OP might want to separate them or something, so I provided the second solution.
3

When you slurp a file with -0777 and use .+ (with the /s modifier) to match the space between two anchors, it will select the largest possible string. From the first + in your case, to the last. Because .+ is greedy. Nor does it help to use a minimal match option .+?, because the minimal string is not what you want.

What you could do is print every line that starts with + or |. Though that will not separate the fields in your output. For example:

perl -ne'print if /^[|+]/' spark.log 

For your sample that will for me print:

+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+
+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

10 Comments

The length of the lines for each table is constant across the lines... can some workaround be done to get the required output
@briandfoy That is true. Fixed.
@stack0114106 This gets the required output. Not sure what you are referring to. Length of lines is irrelevant.
At some point though, the op may want to separate tables.
@sln It might be possible to check for consecutive lines starting with +
|
1

I noticed in the comments along with the previous solution you wanted to add a description to each table so that you can send it via email. I have altered the previous solutions a bit so to include a description. Try this...

$ perl -ne '$s .= $_ if /^[|+]/; END{$s=~s/(\+.*)\n(\+.*|$)/\1\nDescription: <Enter Description Here>\n\n\2/g; print $s;}' spark.log 
+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
Description: <Enter Description Here>

+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+
Description: <Enter Description Here>

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+
Description: <Enter Description Here>

You could put the output in a file and have a different script parse and fill in the description for you if you wanted this done automatically.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.