bakame/html-table
is a small PHP package that allows you to parse, import and manipualte
tabular data represented as HTML Table. Once installed you will be able to do the following:
use Bakame\HtmlTable\Parser;
$table = Parser::new()
->tableHeader(['rank', 'move', 'team', 'player', 'won', 'drawn', 'lost', 'for', 'against', 'gd', 'points'])
->parseFile('https://www.bbc.com/sport/football/tables');
$table
->filter(fn (array $row) => (int) $row['points'] >= 10)
->sorted(fn (array $rowA, array $rowB) => (int) $rowB['for'] <=> (int) $rowA['for'])
->fetchPairs('team', 'for');
// returns
// [
// "Brighton" => "15"
// "Man City" => "14"
// "Tottenham" => "13"
// "Liverpool" => "12"
// "West Ham" => "10"
// "Arsenal" => "9"
// ]
league\csv >= 9.11.0 library is required.
Use composer:
composer require bakame/html-table
The Parser
can convert a file (a PHP stream or a Path with an optional context like fopen
)
or an HTML document into a League\Csv\TabularData
implementing object. Once converted you
can use all the methods and feature made available by the interface (see ResultSet)
for more information.
The Parser
itself is immutable, whenever you change a configuration option a new instance is returned.
The Parser
constructor is private to instantiate the object you are required to use the new
method instead
use Bakame\HtmlTable\Parser;
$parser = Parser::new()
->ignoreTableHeader()
->ignoreXmlErrors()
->withoutFormatter()
->tableCaption('This is a beautiful table');
To extract and parse your table use either the parseHtml
or parseFile
methods.
If parsing is not possible a ParseError
exception will be thrown.
use Bakame\HtmlTable\Parser;
$parser = Parser::new();
$table = $parser->parseHtml('<table>...</table>');
$table = $parser->parseFile('path/to/html/file.html');
parseHtml
parses an HTML page represented by:
- a
string
, - a
Stringable
object, - a
DOMDocument
, - a
DOMElement
, - or a
SimpleXMLElement
whereas parseFile
works with:
- a filepath,
- or a PHP readable stream.
Both methods return a Table
instance which implements the League\Csv\TabularDataReader
interface and also give access to the table caption if present via the getCaption
method.
use Bakame\HtmlTable\Parser;
$html = <<<HTML
<div>
<table>
<caption>Songs</caption>
<thead>
<tr>
<th>Title</th>
<th>Singer</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td>Nakei Nairobi</td>
<td>Mbilia Bel</td>
<td rowspan="3">DR Congo</td>
</tr>
<tr>
<td>Muvaro</td>
<td>Zaiko Langa Langa</td>
</tr>
<tr>
<td>Nzinzi</td>
<td>Emeneya</td>
</tr>
</tbody>
</table>
</div>
HTML;
$table = Parser::new()->parseHtml($html);
$table->getCaption(); //returns 'Songs'
$table->getHeader(); //returns ['Title','Singer', 'Country']
$table->nth(2); //returns ["Title" => "Nzinzi", "Singer" => "Emeneya", "Country" => "DR Congo"]
json_encode($table->slice(0, 1));
//{"caption":"Songs","header":["Title","Singer","Country"],"rows":[{"Title":"Nakei Nairobi","Singer":"Mbilia Bel","Country":"DR Congo"}]}
By default, when calling the Parser::new()
named constructor the parser will:
- try to parse the first table found in the page
- expect the table header row to be the first
tr
found in thethead
section of your table - exclude the table
thead
section when extracting the table content. - ignore XML errors.
- have no formatter attached.
- have no default caption to used if none is present in the table.
Each of the following settings can be changed to improve the conversion against your business rules:
Selecting the table to parse in the HTML page can be done using two (2) methods
Parser::tablePosition
and Parser::tableXpathPosition
If you know the table position in the page in relation with its integer offset or if
you know it's id
attribute value you should use Parser::tablePosition
otherwise
favor Parser::tableXpathPosition
which expects an xpath
expression.
If the expression is valid, and a list of table is found, the first result will be returned.
use Bakame\HtmlTable\Parser;
$parser = Parser::new()->tablePosition('table-id'); // parses the <table id='table-id'>
$parser = Parser::new()->tablePosition(3); // parses the 4th table of the page
$parser = Parser::new()->tableXPathPosition("//main/div/table");
//parse the first table that matches the xpath expression
Parser::tableXpathPosition
and Parser::tablePosition
override each other. It is
recommended to use one or the other but not both at the same time.
You can optionally define a caption for your table if none is present or found during parsing.
use Bakame\HtmlTable\Parser;
$parser = Parser::new()->tableCaption('this is a generated caption');
$parser = Parser::new()->tableCaption(null); // remove any default caption set
The following settings configure the Parser
in relation to the table header. By default,
the parser will try to parse the first tr
tag found in the thead
section of the table.
But you can override this behaviour using one of these settings:
Tells where to locate and resolve the table header
use Bakame\HtmlTable\Parser;
use Bakame\HtmlTable\Section;
$parser = Parser::new()->tableHeaderPosition(Section::Thead, 3);
// header is the 4th row in the <thead> table section
The method uses the Bakame\HtmlTable\Section
enum to designate which table section to use
to resolve the header
use Bakame\HtmlTable\Section;
enum Section
{
case thead;
case tbody;
case tfoot;
case tr;
}
If Section::tr
is used, tr
tags will be used independently of their section.
The second argument is the table header tr
offset; it defaults to 0
(ie: the first row).
Instructs the parser to resolve or not the table header using tableHeaderPosition
configuration.
If no resolution is done, no header will be included in the returned Table
instance.
use Bakame\HtmlTable\Parser;
$parser = Parser::new()->ignoreTableHeader(); // no table header will be resolved
$parser = Parser::new()->resolveTableHeader(); // will attempt to resolve the table header
You can specify directly the header of your table and override any other table header related configuration with this configuration
use Bakame\HtmlTable\Parser;
use Bakame\HtmlTable\Section;
$parser = Parser::new()->tableHeader(['rank', 'team', 'winner']);
If you specify a non-empty array as the table header, it will take precedence over any other table header related options.
Because it is a tabular data each cell MUST be unique otherwise an exception will be thrown
You can skip or re-arrange the source columns by skipping them by their offsets and/or by re-ordering the offsets.
use Bakame\HtmlTable\Parser;
use Bakame\HtmlTable\Section;
$parser = Parser::new()->tableHeader([3 => 'rank', 7 => 'winner', 5 => 'team']);
// only 3 column will be extracted the 4th, 6th and 8th columns
// and re-arrange as 'rank' first and 'team' last
// if a column is missing its value will be PHP `null` type
Tells which section should be parsed based on the Section
enum
use Bakame\HtmlTable\Parser;
use Bakame\HtmlTable\Section;
$parser = Parser::new()->includeSection(Section::Tbody); // thead and tfoot are included during parsing
$parser = Parser::new()->excludeSection(Section::Tr, Section::Tfoot); // table direct tr children and tfoot are not included during parsing
By default, the thead
section is not parse. If a thead
row is selected to be the header, it will
be parsed independently of this setting.
- Parser::new()->includeSection(Section::tbody);
+ Parser::new()->excludeSection(...Section::cases())->includeSection(Section::tbody);
The first call will still include the tfoot
and the tr
sections, whereas the second call
remove any previous setting guaranting that only the tbody
if present will be parsed.
Adds or remove a record formatter applied to the data extracted from the table before you can access it. The header is not affected by the formatter if it is defined.
use Bakame\HtmlTable\Parser;
$parser = Parser::new()->withFormatter($formatter); // attach a formatter to the parser
$parser = Parser::new()->withoutFormatter(); // removed the attached formatter if it exists
The formatter closure signature should be:
function (array $record): array;
If a header was defined or specified, the submitted record will have the header definition set, otherwise an array list is provided.
The following formatter will work on any table content as long as it is defined as a string.
$formatter = fn (array $record): array => array_map(strtolower(...), $record);
// the following formatter will convert all the fields from your table to lowercase.
The following formatter will only work if the table has a header attached to it with
a column named count
.
$formatter = function (array $record): array {
$record['count'] = (int) $record['count'];
return $record;
}
// the following formatter will convert the data of all count column into integer..
Tells whether the parser should ignore or throw in case of malformed HTML content.
use Bakame\HtmlTable\Parser;
$parser = Parser::new()->ignoreXmlErrors(); // ignore the XML errors
$parser = Parser::new()->failOnXmlErrors(3); // throw on XML errors
The library:
- has a PHPUnit test suite
- has a coding style compliance test suite using PHP CS Fixer.
- has a code analysis compliance test suite using PHPStan.
To run the tests, run the following command from the project folder.
composer test
If you discover any security related issues, please email [email protected] instead of using the issue tracker.
The MIT License (MIT). Please see License File for more information.