We receive HTML blood files for clients and I am trying to finish some PHP code to strip, clean and preg strip the code so that I can assemble multiple files into a spreadsheet. The issue is that the HTML file is not playing ball. If anyone can help get the (not) table elements into an array that would be most awesome.
Supplied HTML code (snippet):
<HR>
<PRE><B><U><FONT COLOR="BLUE">HAEMATOLOGY</FONT></U></B>
HAEMOGLOBIN (g/L) 144 g/L 115 - 155
HCT 0.424 0.33 - 0.45
RED CELL COUNT 4.79 x10^12/L 3.95 - 5.15
MCV 88.5 fL 80 - 99
MCH 30.1 pg 27.0 - 33.5
Please note new reference range.
MCHC (g/L) 340 g/L 300 - 350
RDW 13.2 11.5 - 15.0
PLATELET COUNT <FONT Color="red"><B>* 407 x10^9/L 150 - 400</B></FONT>
MPV 9.6 fL 7 - 13
WHITE CELL COUNT 6.16 x10^9/L 3.0 - 10.0
Neutrophils 60.3% 3.71 x10^9/L 2.0 - 7.5
Lymphocytes 29.9% 1.84 x10^9/L 1.2 - 3.65
Monocytes 6.7% 0.41 x10^9/L 0.2 - 1.0
Eosinophils 2.1% 0.13 x10^9/L 0.0 - 0.4
Basophils 1.0% 0.06 x10^9/L 0.0 - 0.1
All cell populations appear normal.
<B><U><FONT COLOR="BLUE">BIOCHEMISTRY</FONT></U></B>
I have used a combination of string replace, preg replace and removing code to get to an output like this (using var dump):
22 => string 'HAEMOGLOBIN 160 130' (length=98)
23 => string '170' (length=3)
24 => string 'HCT 0.468 0.37' (length=122)
25 => string '0.50' (length=4)
26 => string 'RED CELL COUNT 4.88 x10^12/L 4.40' (length=104)
27 => string '5.80' (length=4)
28 => string 'MCV 95.9 fL ' (length=117)
29 => string '80' (length=2)
30 => string '99' (length=2)
31 => string 'MCH 32.8 pg 27.0' (length=121)
32 => string '33.5' (length=4)
33 => string ' Please note new reference range.' (length=94)
34 => string 'MCHC 342 300' (length=106)
35 => string '350' (length=3)
36 => string 'RDW 12.4 11.5' (length=123)
37 => string '15.0' (length=4)
38 => string 'PLATELET COUNT 251 x10^9/L 150' (length=105)
39 => string '400' (length=3)
40 => string 'MPV 9.5 fL ' (length=118)
41 => string '7' (length=1)
42 => string '13' (length=2)
43 => string 'WHITE CELL COUNT 3.97 x10^9/L 3.0' (length=103)
My code is not elegant...
$myfile = file_get_contents($fileURL);
$fileString = file_get_contents($fileURL);
$parts = $fileString;
$flags = PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY;
// remove HTML code
$part_regex = '/(<)(.*?)(>)/';
$parts = preg_replace($part_regex, '', $parts);
//Remove unecessary deliminaters
$parts = str_replace('|', '', $parts);
$parts = str_replace('-', '', $parts);
$parts = str_replace('(g/L)', '', $parts);
$parts = str_replace('g/L', '', $parts);
$parts = str_replace(' ', ' ', $parts);
//Split file string based on spaces
$regex = '/\s\s+/';
$parts = preg_split( $regex, $parts, -1, $flags);
foreach ($parts as $part) {
//$part = str_replace(' ', '|', $part);
$part = trim($part);
if ($part == '') { unset($part);}
else {
$cleanpart = $part;
array_push($cleanfile, $cleanpart);
}
}
var_dump($cleanfile);
I have tried various preg replace options as well as html decode but cannot get an output that consistently splits the table as required. I am loathed to split on string position as the files supplied seem to change format and my code needs to flex to that.
[update]
I would like the original HTML code to be split into an array as below:
Currently:
22 => string 'HAEMOGLOBIN 160
130' (length=98)
Ideal array output:
22 => string 'HAEMOGLOBIN' (length...)
23 => string '160' (length...)
24 => string '130' (length...)
DOMDocumentorsimple-php-dom.