Extract multiple tabular data from spark logs

Question

I'm trying to extract the tabular data from apache pyspark logs using a perl one-liner.

Below is the sample file log and there are 3 tabular output from the spark output:

24/06/19 01:00:00 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:01 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:02 INFO org.apache.spark.scheduler.DAGScheduler: Registering RDD 0 (text at <stdin>:1)
24/06/19 01:00:03 DEBUG pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: This is a debug message from my PySpark code.
+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
24/06/19 01:00:04 INFO pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: Processing data in MyPySparkApp.
24/06/19 01:00:05 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on host localhost: Executor lost connection, trying to reconnect.
24/06/19 01:00:07 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:08 INFO org.apache.spark.SparkContext: Stopped SparkContext
+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

24/06/19 01:00:09 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:09 INFO org.apache.spark.SparkContext: Stopped SparkContext

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

root
 |-- empno: integer (nullable = true)
 |-- ename: string (nullable = true)
 |-- job: string (nullable = true)
 |-- mgr: integer (nullable = true)
 |-- hiredate: string (nullable = true)
 |-- sal: integer (nullable = true)
 |-- comm: integer (nullable = true)
 |-- deptno: integer (nullable = true)
 
24/06/19 01:00:20 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:21 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:22 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:23 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp2

When there is only one tabular output the below command works:

perl -0777 -ne ' while(m/^\x2b(.+)\x2b$/gsm) { print "$&\n" } ' spark.log # \x2b="+"

but for multiple tabular outputs, it pulls all the text from first occurrence to end of last tabular occurrence. How do I get all the 3 tabular output from my sample log?

Expected output:

Table-1:

+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+

Table-2

+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+

Table-3

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

Though that wont separate the different tables from each other — TLP
– TLP, Commented Jun 18 at 20:12

ikegami · Accepted Answer · 2025-06-18 22:51:50Z

4

The problem with your attempt is that . matches any characters (including Line Feed characters) when the s modifier is used.

The following will do the trick, and it does it without loading the entire log file into memory:

perl -ne'print if /^[|+]/' spark.log

Same idea, using grep:

grep '^[|+]' spark.log

The following version identifies the individual tables (allowing you to do something at the table level):

perl -gne'
   print "Table ", ++$i, ":\n", $&
      while /
         ^
         ( \+ .* \+\n )
         \| .* \|\n
         \1
         (?: \| .* \|\n )*
         \1
      /xmg;
' spark.log

The same, but using fewer lines:

perl -gne'
   print "Table ", ++$i, ":\n", $&
      while /^(\+.*\+\n)\|.*\|\n\1(?:\|.*\|\n)*\1/mg;
' spark.log

These are stricter as to what is considered a table, so they might work better than the first solution. Also, I'm assuming it's not a problem to read the whole log file into memory. It's not necessary, but it's simpler to write.

(Use -0777 instead of -g if your Perl is too old to support -g.)

edited Jun 18 at 22:51

answered Jun 18 at 21:31

ikegami

391k17 gold badges291 silver badges555 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TLP Jun 18 at 23:00

You could replace "Table "... with a placeholder that could be used for HTML output

ikegami Jun 18 at 23:06

print "Table ", ++$i, ":\n", $& is a placeholder for whatever you want do with the table ($&).

TLP Jun 18 at 23:35

I guess that is true. OP didnt really define how to finalize his output

ikegami Jun 18 at 23:53

@TLP, For all I know, the first solution is good enough. I just thought the OP might want to separate them or something, so I provided the second solution.

TLP · Accepted Answer · 2025-06-18 20:37:01Z

3

When you slurp a file with -0777 and use .+ (with the /s modifier) to match the space between two anchors, it will select the largest possible string. From the first + in your case, to the last. Because .+ is greedy. Nor does it help to use a minimal match option .+?, because the minimal string is not what you want.

What you could do is print every line that starts with + or |. Though that will not separate the fields in your output. For example:

perl -ne'print if /^[|+]/' spark.log

For your sample that will for me print:

+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+
+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+

edited Jun 18 at 20:37

answered Jun 18 at 20:17

TLP

68.2k10 gold badges97 silver badges156 bronze badges

10 Comments

stack0114106 Jun 18 at 20:33

The length of the lines for each table is constant across the lines... can some workaround be done to get the required output

TLP Jun 18 at 20:37

@briandfoy That is true. Fixed.

TLP Jun 18 at 20:38

@stack0114106 This gets the required output. Not sure what you are referring to. Length of lines is irrelevant.

sln Jun 18 at 20:39

At some point though, the op may want to separate tables.

TLP Jun 18 at 20:40

@sln It might be possible to check for consecutive lines starting with +

|

mickmackusa · Accepted Answer · 2025-06-26 22:48:58Z

I noticed in the comments along with the previous solution you wanted to add a description to each table so that you can send it via email. I have altered the previous solutions a bit so to include a description. Try this...

$ perl -ne '$s .= $_ if /^[|+]/; END{$s=~s/(\+.*)\n(\+.*|$)/\1\nDescription: <Enter Description Here>\n\n\2/g; print $s;}' spark.log 
+----+----------+-----+---+
|acct|        dt|  amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7|  2|
|ACC2|2010-06-22| 23.4|  2|
|ACC4|2010-06-21| 21.5|  2|
|ACC5|2010-06-23| 34.9|  2|
|ACC6|2010-06-25|100.0|  1|
+----+----------+-----+---+
Description: <Enter Description Here>

+----------+-----+
|inc       |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England   |true |
+----------+-----+
Description: <Enter Description Here>

+-----+------+---------+----+---------+----+----+------+
|empno| ename|      job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH|    CLERK|7902|17-Dec-80| 800|  20|    10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300|    30|
| 7521|  WARD| SALESMAN|7698|22-Feb-81|1250| 500|    30|
| 7566| JONES|  MANAGER|7839| 2-Apr-81|2975|   0|    20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400|    30|
| 7698| BLAKE|  MANAGER|7839| 1-May-81|2850|   0|    30|
| 7782| CLARK|  MANAGER|7839| 9-Jun-81|2450|   0|    10|
| 7788| SCOTT|  ANALYST|7566|19-Apr-87|3000|   0|    20|
| 7839|  KING|PRESIDENT|   0|17-Nov-81|5000|   0|    10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500|   0|    30|
| 7876| ADAMS|    CLERK|7788|23-May-87|1100|   0|    20|
+-----+------+---------+----+---------+----+----+------+
Description: <Enter Description Here>

You could put the output in a file and have a different script parse and fill in the description for you if you wanted this done automatically.

Collectives™ on Stack Overflow

Extract multiple tabular data from spark logs

3 Answers 3

4 Comments

10 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related