README.md 12.4 KB
Newer Older
Barnaby Walters's avatar
Barnaby Walters committed
1 2 3
php-mf2
=======

Barnaby Walters's avatar
Barnaby Walters committed
4 5 6
php-mf2 is a pure, generic [microformats-2](http://microformats.org/wiki/microformats-2) parser. It makes HTML as easy to consume as JSON.

Instead of having a hard-coded list of all the different microformats, it follows a set of procedures to handle different property types (e.g. `p-` for plaintext, `u-` for URL, etc). This allows for a very small and maintainable parser.
7 8 9

## Installation

Barnaby Walters's avatar
Barnaby Walters committed
10 11 12
Install php-mf2 with [Composer](http://getcomposer.org) by adding `"mf2/mf2": "0.2.*"` to the `require` object in your `composer.json` and running <kbd>php composer.phar update</kbd>.

You could install it by just downloading `/Mf2/Parser.php` and including that, but please use Composer. Seriously, it’s amazing.
13 14 15

## Usage

Christian Weiske's avatar
Christian Weiske committed
16
mf2 is PSR-0 autoloadable, so all you have to do to load it is:
17 18

1. Include Composer’s auto-generated autoload file (`/vendor/autoload.php`)
Barnaby Walters's avatar
Barnaby Walters committed
19
1. Call `Mf2\parse()` with the HTML (or a DOMDocument), and optionally the URL to resolve relative URLs against.
20

Barnaby Walters's avatar
Barnaby Walters committed
21 22 23
## Examples

### Parsing implied microformats2
24

25
```php
26 27
<?php

Barnaby Walters's avatar
Barnaby Walters committed
28
namespace YourApp;
29

Barnaby Walters's avatar
Barnaby Walters committed
30
require '/vendor/autoload.php';
31

Barnaby Walters's avatar
Barnaby Walters committed
32
use Mf2;
33

Barnaby Walters's avatar
Barnaby Walters committed
34
$output = Mf2\parse('<p class="h-card">Barnaby Walters</p>');
35 36
```

37
`$output` is a canonical microformats2 array structure like:
Barnaby Walters's avatar
Barnaby Walters committed
38 39 40 41 42 43 44 45 46 47 48

```json
{
	"items": [{
		"type": ["h-card"],
		"properties": {
			"name": ["Barnaby Walters"]
		}
	}],
	"rels": {}
}
49 50
```

Barnaby Walters's avatar
Barnaby Walters committed
51
If no microformats are found, `items` will be an empty array.
52

Barnaby Walters's avatar
Barnaby Walters committed
53
Note that, whilst the property prefixes are stripped, the prefix of the `h-*` classname(s) in the "type" array are left on.
54

Barnaby Walters's avatar
Barnaby Walters committed
55
### Parsing a document with relative URLs
56

Barnaby Walters's avatar
Barnaby Walters committed
57
Most of the time you’ll be getting your input HTML from a URL. You should pass that URL as the second parameter to `Mf2\parse()` so that any relative URLs in the document can be resolved. For example, say you got the following HTML from `http://example.com/`:
58

Barnaby Walters's avatar
Barnaby Walters committed
59 60 61 62 63 64
```html
<div class="h-card">
	<h1 class="p-name">Mr. Example</h1>
	<img class="u-photo" alt="" src="photo.png" />
</div>
```
65

Barnaby Walters's avatar
Barnaby Walters committed
66
Parsing like this:
67 68

```php
Barnaby Walters's avatar
Barnaby Walters committed
69 70
$output = Mf2\parse($html, 'http://example.org');
```
71

Barnaby Walters's avatar
Barnaby Walters committed
72 73 74 75 76 77 78 79 80 81 82 83 84
will result in the following output, with relative URLs made absolute:

```json
{
	"items": [{
		"type": ["h-card"],
		"properties": {
			"photo": ["http://example.org/photo.png"]
		}
	}],
	"rels": {}
}
```
85

Barnaby Walters's avatar
Barnaby Walters committed
86
php-mf2 correctly handles relative URL resolution according to the URI and HTML specs, including correct use of the `<base>` element.
87

Barnaby Walters's avatar
Barnaby Walters committed
88
### Parsing `rel` and `rel=alternate` values
89

Barnaby Walters's avatar
Barnaby Walters committed
90
php-mf2 also parses any link relations in the document, placing them into two top-level arrays — one for `rel=alternate` and another for all other rel values, e.g. when parsing:
91

Barnaby Walters's avatar
Barnaby Walters committed
92 93 94
```html
<a rel="me" href="https://twitter.com/barnabywalters">Me on twitter</a>
<link rel="alternate etc" href="http://example.com/notes.atom" />
95 96
```

Barnaby Walters's avatar
Barnaby Walters committed
97 98 99 100 101 102 103 104 105 106 107 108 109 110
parsing will result in the following keys:

```json
{
	"items": [],
	"rels": {
		"me": ["https://twitter.com/barnabywalters"]
	},
	"alternates": [{
		"url": "http://example.com/notes.atom",
		"rel": "etc"
	}]
}
```
111

Barnaby Walters's avatar
Barnaby Walters committed
112
Protip: if you’re not bothered about the microformats2 data and just want rels and alternates, you can improve performance by creating a `Mf2\Parser` object (see below) and calling `->parseRelsAndAlternates()` instead of `->parse()`, e.g.
113 114

```php
Barnaby Walters's avatar
Barnaby Walters committed
115 116 117 118 119 120
<?php

use Mf2;

$parser = new Mf2\Parser('<link rel="…');
$relsAndAlternates = $parser->parseRelsAndAlternates();
121 122
```

Barnaby Walters's avatar
Barnaby Walters committed
123
### Getting more control by creating a Parser object
124

Barnaby Walters's avatar
Barnaby Walters committed
125
The `Mf2\parse()` function covers the most common usage patterns by internally creating an instance of `Mf2\Parser` and returning the output all in one step. For some advanced usage you can also create an instance of `Mf2\Parser` yourself.
126

Barnaby Walters's avatar
Barnaby Walters committed
127
The constructor takes two arguments, the input HTML (or a DOMDocument) and the URL to use as a base URL. Once you have a parser, there are a few other things you can do:
128

Barnaby Walters's avatar
Barnaby Walters committed
129
### Selectively parsing a document
Barnaby Walters's avatar
Barnaby Walters committed
130

Barnaby Walters's avatar
Barnaby Walters committed
131
There are several ways to selectively parse microformats from a document. If you wish to only parse microformats from an element with a particular ID, `Parser::parseFromId($id) ` is the easiest way.
132

Barnaby Walters's avatar
Barnaby Walters committed
133 134 135 136 137
If your needs are more complex, `Parser::parse` accepts an optional context DOMNode as its second parameter. Typically you’d use `Parser::query` to run XPath queries on the document to get the element you want to parse from under, then pass it to `Parser::parse`. Example usage:

```php
$doc = 'More microformats, more microformats <div id="parse-from-here"><span class="h-card">This shows up</span></div> yet more ignored content';
$parser = new Mf2\Parser($doc);
138

Barnaby Walters's avatar
Barnaby Walters committed
139
$parser->parseFromId('parse-from-here'); // returns a document with only the h-card descended from div#parse-from-here
140

Barnaby Walters's avatar
Barnaby Walters committed
141
$elementIWant = $parser->query('an xpath query')[0];
142

Barnaby Walters's avatar
Barnaby Walters committed
143
$parser->parse(true, $elementIWant); // returns a document with only mfs under the selected element
144

Barnaby Walters's avatar
Barnaby Walters committed
145
```
146

147 148 149 150 151 152 153 154 155 156 157
### Generating output for JSON serialization with JSON-mode

Due to a quirk with the way PHP arrays work, there is an edge case ([reported](https://github.com/indieweb/php-mf2/issues/29) by Tom Morris) in which a document with no rel values, when serialised as JSON, results in an empty object as the rels value rather than an empty array. Replacing this in code with a stdClass breaks PHP iteration over the values.

As of version 0.2.6, the default behaviour is back to being PHP-friendly, so if you want to produce results specifically for serialisation as JSON (for example if you run a HTML -> JSON service, or want to run tests against JSON fixtures), enable JSON mode:

```php
// …by passing true as the third constructor:
$jsonParser = new Mf2\Parser($html, $url, true);
```

Barnaby Walters's avatar
Barnaby Walters committed
158
### Classic Microformats Markup
159

Barnaby Walters's avatar
Barnaby Walters committed
160
php-mf2 has some support for parsing classic microformats markup. It’s enabled by default, but can be turned off by calling `Mf2\parse($html, $url, false);` or `$parser->parse(false);` if you’re instanciating a parser yourself.
161

Barnaby Walters's avatar
Barnaby Walters committed
162
In previous versions of php-mf2 you could also add your own class mappings — officially this is no longer supported.
163

Barnaby Walters's avatar
Barnaby Walters committed
164 165 166
* If the built in mappings don’t successfully parse some classic microformats markup then raise an issue and we’ll fix it.
* If you want to screen-scrape websites which don’t use mf2 into mf2 data structures, consider contributing to [php-mf2-shim](https://github.com/indieweb/php-mf2-shim)
* If you *really* need to make one-off changes to the default mappings… It is possible. But you have to figure it out for yourself ;)
167

Barnaby Walters's avatar
Barnaby Walters committed
168
## Security
169

Barnaby Walters's avatar
Barnaby Walters committed
170 171 172 173 174 175 176 177
**No filtering of content takes place in mf2\Parser, so treat its output as you would any untrusted data from the source of the parsed document.**

Some tips:

* All content apart from the 'html' key in dictionaries produced by parsing an `e-*` property is not HTML-escaped. For example, `<span class="p-name">&lt;code&gt;</span>` will result in `"name": ["<code>"]`. At the very least, HTML-escape all properties before echoing them out in HTML
* If you’re using the raw HTML content under the 'html' key of dictionaries produced by parsing `e-*` properties, you SHOULD purify the HTML before displaying it to prevent injection of arbitrary code. For PHP I recommend using [HTML Purifier](http://htmlpurifier.org)

TODO: move this section to a security/consumption best practises page on the wiki
Barnaby Walters's avatar
Barnaby Walters committed
178

179 180
## Contributing

181
Issues and bug reports are very welcome. If you know how to write tests then please do so as code always expresses problems and intent much better than English, and gives me a way of measuring whether or not fixes have actually solved your problem. If you don’t know how to write tests, don’t worry :) Just include as much useful information in the issue as you can.
182

183
Pull requests very welcome, please try to maintain stylistic, structural and naming consistency with the existing codebase, and don’t be too upset if I make naming changes :)
184

185
### How to make a Pull Request
186

187 188 189 190 191 192 193 194 195
1. Fork the repo to your github account
2. Clone a copy to your computer (simply installing php-mf2 using composer only works for using it, not developing it)
3. Install the dev dependencies with `./composer.phar install`
4. Run PHPUnit with `./vendor/bin/phpunit`
5. Make your changes
6. Add PHPUnit tests for your changes, either in an existing test file if suitable, or a new one
7. Make sure your tests pass (`./vendor/bin/phpunit`)
8. Go to your fork of the repo on github.com and make a pull request, preferably with a short summary, detailed description and references to issues/parsing specs as appropriate
9. Bask in the warm feeling of having contributed to a piece of free software
196

Barnaby Walters's avatar
Barnaby Walters committed
197 198
## Testing

Barnaby Walters's avatar
Barnaby Walters committed
199
Tests are written in phpunit and are contained within `/tests/`. Running <kbd>bin/phpunit</kbd> from the root dir will run them all.
Barnaby Walters's avatar
Barnaby Walters committed
200

Barnaby Walters's avatar
Barnaby Walters committed
201
There are enough tests to warrant putting them into separate suites for maintenance. They should be fairly self-explanatory.
Barnaby Walters's avatar
Barnaby Walters committed
202

Barnaby Walters's avatar
Barnaby Walters committed
203
php-mf2 can also be hooked up to the official, cross-platform [microformats2 test suite](https://github.com/microformats/tests). TODO: write a guide on how to do this, make a public endpoint for people to look at the results
204 205 206

### Changelog

207 208 209 210 211
#### v0.2.6

* Added JSON mode as long-term fix for #29
* Fixed bug causing microformats nested under multiple property names to be parsed only once

Barnaby Walters's avatar
Barnaby Walters committed
212 213 214 215 216 217
#### v0.2.5

* Removed conditional replacing empty rel list with stdclass. Original purpose was to make JSON-encoding the output from the parser correct but it also caused Fatal Errors due to trying to treat stdclass as array.

#### v0.2.4

218 219 220 221 222 223 224 225 226 227 228
#### v0.2.3

* Made p-* parsing consistent with implied name parsing
* Stopped collapsing whitespace in p-* properties
* Implemented unicodeTrim which removes &nbsp; characters as well as regex \s
* Added support for implied name via abbr[title]
* Prevented excessively nested value-class elements from being parsed incorrectly, removed incorrect separator which was getting added in some cases
* Updated u-* parsing to be spec-compliant, matching [href] before value-class and only attempting URL resolution for URL attributes
* Added support for input[value] parsing
* Tests for all the above

Barnaby Walters's avatar
Barnaby Walters committed
229 230 231
#### v0.2.2

* Made resolveUrl method public, allowing advanced parsers and subclasses to make use of it
232
* Fixed bug causing multiple duplicate property values to appear
Barnaby Walters's avatar
Barnaby Walters committed
233

Barnaby Walters's avatar
Barnaby Walters committed
234 235 236 237
#### v0.2.1

* Fixed bug causing classic microformats property classnames to not be parsed correctly

Barnaby Walters's avatar
Barnaby Walters committed
238 239 240 241
#### v0.2.0 (BREAKING CHANGES)

* Namespace change from mf2 to Mf2, for PSR-0 compatibility
* `Mf2\parse()` function added to simplify the most common case of just parsing some HTML
Barnaby Walters's avatar
Barnaby Walters committed
242
* Updated e-* property parsing rules to match mf2 parsing spec — instead of producing inconsistent HTML content, it now produces dictionaries like <pre><code>
Barnaby Walters's avatar
Barnaby Walters committed
243 244 245 246
{
	"html": "<b>The Content</b>",
	"value: "The Content"
}
Barnaby Walters's avatar
Barnaby Walters committed
247
</code></pre>
Barnaby Walters's avatar
Barnaby Walters committed
248 249 250 251
* Removed `htmlSafe` options as new e-* parsing rules make them redundant
* Moved a whole load of static functions out of the class and into standalone functions
* Changed autoloading to always include Parser.php instead of using classmap

Barnaby Walters's avatar
Barnaby Walters committed
252 253 254 255 256
#### v0.1.23

* Made some changes to the way back-compatibility with classic microformats are handled, ignoring classic property classnames inside mf2 roots and outside classic roots
* Deprecated ability to add new classmaps, removed twitter classmap. Use [php-mf2-shim](http://github.com/indieweb/php-mf2-shim) instead, it’s better

Barnaby Walters's avatar
Barnaby Walters committed
257 258 259 260
#### v0.1.22

* Converts classic microformats by default

Barnaby Walters's avatar
Barnaby Walters committed
261 262 263 264 265 266 267 268 269
#### v0.1.21

* Removed webignition dependency, also removing ext-intl dependency. php-mf2 is now a standalone, single file library again
* Replaced webignition URL resolving with custom code passing almost all tests, courtesy of <a class="h-card" href="http://aaronparecki.com">Aaron Parecki</a>

#### v0.1.20

* Added in almost-perfect custom URL resolving code

Barnaby Walters's avatar
Barnaby Walters committed
270 271 272 273
#### v0.1.19 (2013-06-11)

* Required stable version of webigniton/absolute-url-resolver, hopefully resolving versioning problems

Barnaby Walters's avatar
Barnaby Walters committed
274 275 276 277 278
#### v0.1.18 (2013-06-05)

* Fixed problems with isElementParsed, causing elements to be incorrectly parsed
* Cleaned up some test files

Barnaby Walters's avatar
Barnaby Walters committed
279 280 281 282
#### v0.1.17

* Rewrote some PHP 5.4 array syntax which crept into 0.1.16 so php-mf2 still works on PHP 5.3
* Fixed a bug causing weird partial microformats to be added to parent microformats if they had doubly property-nested children
Barnaby Walters's avatar
Barnaby Walters committed
283 284
* Finally actually licensed this project under a real license (MIT, in composer.json)
* Suggested barnabywalters/mf-cleaner in composer.json
Barnaby Walters's avatar
Barnaby Walters committed
285

286 287 288 289 290 291 292 293 294 295
#### v0.1.16

* Ability to parse from only an ID
* Context DOMElement can be passed to $parse
* Parser::query runs XPath queries on the current document
* When parsing e-* properties, elements with @src, @data or @href have relative URLs resolved in the output

#### v0.1.15

* Added html-safe options
296
* Added rel+rel-alternate parsing