I'm trying to extract the tabular data from apache pyspark logs using a perl one-liner.
Below is the sample file log and there are 3 tabular output from the spark output:
24/06/19 01:00:00 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:01 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:02 INFO org.apache.spark.scheduler.DAGScheduler: Registering RDD 0 (text at <stdin>:1)
24/06/19 01:00:03 DEBUG pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: This is a debug message from my PySpark code.
+----+----------+-----+---+
|acct| dt| amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7| 2|
|ACC2|2010-06-22| 23.4| 2|
|ACC4|2010-06-21| 21.5| 2|
|ACC5|2010-06-23| 34.9| 2|
|ACC6|2010-06-25|100.0| 1|
+----+----------+-----+---+
24/06/19 01:00:04 INFO pyspark_logging_examples.workloads.sample_logging_job.SampleLoggingJob: Processing data in MyPySparkApp.
24/06/19 01:00:05 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) on host localhost: Executor lost connection, trying to reconnect.
24/06/19 01:00:07 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:08 INFO org.apache.spark.SparkContext: Stopped SparkContext
+----------+-----+
|inc |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England |true |
+----------+-----+
24/06/19 01:00:09 INFO org.apache.spark.scheduler.DAGScheduler: Job 0 finished: collect at <stdin>:1, took 7.0000 s
24/06/19 01:00:09 INFO org.apache.spark.SparkContext: Stopped SparkContext
+-----+------+---------+----+---------+----+----+------+
|empno| ename| job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH| CLERK|7902|17-Dec-80| 800| 20| 10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|22-Feb-81|1250| 500| 30|
| 7566| JONES| MANAGER|7839| 2-Apr-81|2975| 0| 20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839| 1-May-81|2850| 0| 30|
| 7782| CLARK| MANAGER|7839| 9-Jun-81|2450| 0| 10|
| 7788| SCOTT| ANALYST|7566|19-Apr-87|3000| 0| 20|
| 7839| KING|PRESIDENT| 0|17-Nov-81|5000| 0| 10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|23-May-87|1100| 0| 20|
+-----+------+---------+----+---------+----+----+------+
root
|-- empno: integer (nullable = true)
|-- ename: string (nullable = true)
|-- job: string (nullable = true)
|-- mgr: integer (nullable = true)
|-- hiredate: string (nullable = true)
|-- sal: integer (nullable = true)
|-- comm: integer (nullable = true)
|-- deptno: integer (nullable = true)
24/06/19 01:00:20 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:21 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp
24/06/19 01:00:22 INFO org.apache.spark.SparkContext: Running Spark version 3.5.1
24/06/19 01:00:23 INFO org.apache.spark.SparkContext: Submitted application: MyPySparkApp2
When there is only one tabular output the below command works:
perl -0777 -ne ' while(m/^\x2b(.+)\x2b$/gsm) { print "$&\n" } ' spark.log # \x2b="+"
but for multiple tabular outputs, it pulls all the text from first occurrence to end of last tabular occurrence. How do I get all the 3 tabular output from my sample log?
Expected output:
Table-1:
+----+----------+-----+---+
|acct| dt| amt| rk|
+----+----------+-----+---+
|ACC3|2010-06-24| 35.7| 2|
|ACC2|2010-06-22| 23.4| 2|
|ACC4|2010-06-21| 21.5| 2|
|ACC5|2010-06-23| 34.9| 2|
|ACC6|2010-06-25|100.0| 1|
+----+----------+-----+---+
Table-2
+----------+-----+
|inc |check|
+----------+-----+
|Australia |true |
|Bangladesh|false|
|England |true |
+----------+-----+
Table-3
+-----+------+---------+----+---------+----+----+------+
|empno| ename| job| mgr| hiredate| sal|comm|deptno|
+-----+------+---------+----+---------+----+----+------+
| 7369| SMITH| CLERK|7902|17-Dec-80| 800| 20| 10|
| 7499| ALLEN| SALESMAN|7698|20-Feb-81|1600| 300| 30|
| 7521| WARD| SALESMAN|7698|22-Feb-81|1250| 500| 30|
| 7566| JONES| MANAGER|7839| 2-Apr-81|2975| 0| 20|
| 7654|MARTIN| SALESMAN|7698|28-Sep-81|1250|1400| 30|
| 7698| BLAKE| MANAGER|7839| 1-May-81|2850| 0| 30|
| 7782| CLARK| MANAGER|7839| 9-Jun-81|2450| 0| 10|
| 7788| SCOTT| ANALYST|7566|19-Apr-87|3000| 0| 20|
| 7839| KING|PRESIDENT| 0|17-Nov-81|5000| 0| 10|
| 7844|TURNER| SALESMAN|7698| 8-Sep-81|1500| 0| 30|
| 7876| ADAMS| CLERK|7788|23-May-87|1100| 0| 20|
+-----+------+---------+----+---------+----+----+------+
.+= problem ?perl -ne'print if /^[|+]/' spark.log/^\+-.*?-\+\R(?![+|])/sgmregex101.com/r/4fwfWs/1\+is more readable than\x2b