You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
229 lines
8.3 KiB
229 lines
8.3 KiB
1 month ago
|
# sax js
|
||
|
|
||
|
A sax-style parser for XML and HTML.
|
||
|
|
||
|
Designed with [node](http://nodejs.org/) in mind, but should work fine in
|
||
|
the browser or other CommonJS implementations.
|
||
|
|
||
|
## What This Is
|
||
|
|
||
|
* A very simple tool to parse through an XML string.
|
||
|
* A stepping stone to a streaming HTML parser.
|
||
|
* A handy way to deal with RSS and other mostly-ok-but-kinda-broken XML
|
||
|
docs.
|
||
|
|
||
|
## What This Is (probably) Not
|
||
|
|
||
|
* An HTML Parser - That's a fine goal, but this isn't it. It's just
|
||
|
XML.
|
||
|
* A DOM Builder - You can use it to build an object model out of XML,
|
||
|
but it doesn't do that out of the box.
|
||
|
* XSLT - No DOM = no querying.
|
||
|
* 100% Compliant with (some other SAX implementation) - Most SAX
|
||
|
implementations are in Java and do a lot more than this does.
|
||
|
* An XML Validator - It does a little validation when in strict mode, but
|
||
|
not much.
|
||
|
* A Schema-Aware XSD Thing - Schemas are an exercise in fetishistic
|
||
|
masochism.
|
||
|
* A DTD-aware Thing - Fetching DTDs is a much bigger job.
|
||
|
|
||
|
## Regarding `<!DOCTYPE`s and `<!ENTITY`s
|
||
|
|
||
|
The parser will handle the basic XML entities in text nodes and attribute
|
||
|
values: `& < > ' "`. It's possible to define additional
|
||
|
entities in XML by putting them in the DTD. This parser doesn't do anything
|
||
|
with that. If you want to listen to the `ondoctype` event, and then fetch
|
||
|
the doctypes, and read the entities and add them to `parser.ENTITIES`, then
|
||
|
be my guest.
|
||
|
|
||
|
Unknown entities will fail in strict mode, and in loose mode, will pass
|
||
|
through unmolested.
|
||
|
|
||
|
## Usage
|
||
|
|
||
|
```javascript
|
||
|
var sax = require("./lib/sax"),
|
||
|
strict = true, // set to false for html-mode
|
||
|
parser = sax.parser(strict);
|
||
|
|
||
|
parser.onerror = function (e) {
|
||
|
// an error happened.
|
||
|
};
|
||
|
parser.ontext = function (t) {
|
||
|
// got some text. t is the string of text.
|
||
|
};
|
||
|
parser.onopentag = function (node) {
|
||
|
// opened a tag. node has "name" and "attributes"
|
||
|
};
|
||
|
parser.onattribute = function (attr) {
|
||
|
// an attribute. attr has "name" and "value"
|
||
|
};
|
||
|
parser.onend = function () {
|
||
|
// parser stream is done, and ready to have more stuff written to it.
|
||
|
};
|
||
|
|
||
|
parser.write('<xml>Hello, <who name="world">world</who>!</xml>').close();
|
||
|
|
||
|
// stream usage
|
||
|
// takes the same options as the parser
|
||
|
var saxStream = require("sax").createStream(strict, options)
|
||
|
saxStream.on("error", function (e) {
|
||
|
// unhandled errors will throw, since this is a proper node
|
||
|
// event emitter.
|
||
|
console.error("error!", e)
|
||
|
// clear the error
|
||
|
this._parser.error = null
|
||
|
this._parser.resume()
|
||
|
})
|
||
|
saxStream.on("opentag", function (node) {
|
||
|
// same object as above
|
||
|
})
|
||
|
// pipe is supported, and it's readable/writable
|
||
|
// same chunks coming in also go out.
|
||
|
fs.createReadStream("file.xml")
|
||
|
.pipe(saxStream)
|
||
|
.pipe(fs.createWriteStream("file-copy.xml"))
|
||
|
```
|
||
|
|
||
|
|
||
|
## Arguments
|
||
|
|
||
|
Pass the following arguments to the parser function. All are optional.
|
||
|
|
||
|
`strict` - Boolean. Whether or not to be a jerk. Default: `false`.
|
||
|
|
||
|
`opt` - Object bag of settings regarding string formatting. All default to `false`.
|
||
|
|
||
|
Settings supported:
|
||
|
|
||
|
* `trim` - Boolean. Whether or not to trim text and comment nodes.
|
||
|
* `normalize` - Boolean. If true, then turn any whitespace into a single
|
||
|
space.
|
||
|
* `lowercase` - Boolean. If true, then lowercase tag names and attribute names
|
||
|
in loose mode, rather than uppercasing them.
|
||
|
* `xmlns` - Boolean. If true, then namespaces are supported.
|
||
|
* `position` - Boolean. If false, then don't track line/col/position.
|
||
|
* `strictEntities` - Boolean. If true, only parse [predefined XML
|
||
|
entities](http://www.w3.org/TR/REC-xml/#sec-predefined-ent)
|
||
|
(`&`, `'`, `>`, `<`, and `"`)
|
||
|
* `unquotedAttributeValues` - Boolean. If true, then unquoted
|
||
|
attribute values are allowed. Defaults to `false` when `strict`
|
||
|
is true, `true` otherwise.
|
||
|
|
||
|
## Methods
|
||
|
|
||
|
`write` - Write bytes onto the stream. You don't have to do this all at
|
||
|
once. You can keep writing as much as you want.
|
||
|
|
||
|
`close` - Close the stream. Once closed, no more data may be written until
|
||
|
it is done processing the buffer, which is signaled by the `end` event.
|
||
|
|
||
|
`resume` - To gracefully handle errors, assign a listener to the `error`
|
||
|
event. Then, when the error is taken care of, you can call `resume` to
|
||
|
continue parsing. Otherwise, the parser will not continue while in an error
|
||
|
state.
|
||
|
|
||
|
## Members
|
||
|
|
||
|
At all times, the parser object will have the following members:
|
||
|
|
||
|
`line`, `column`, `position` - Indications of the position in the XML
|
||
|
document where the parser currently is looking.
|
||
|
|
||
|
`startTagPosition` - Indicates the position where the current tag starts.
|
||
|
|
||
|
`closed` - Boolean indicating whether or not the parser can be written to.
|
||
|
If it's `true`, then wait for the `ready` event to write again.
|
||
|
|
||
|
`strict` - Boolean indicating whether or not the parser is a jerk.
|
||
|
|
||
|
`opt` - Any options passed into the constructor.
|
||
|
|
||
|
`tag` - The current tag being dealt with.
|
||
|
|
||
|
And a bunch of other stuff that you probably shouldn't touch.
|
||
|
|
||
|
## Events
|
||
|
|
||
|
All events emit with a single argument. To listen to an event, assign a
|
||
|
function to `on<eventname>`. Functions get executed in the this-context of
|
||
|
the parser object. The list of supported events are also in the exported
|
||
|
`EVENTS` array.
|
||
|
|
||
|
When using the stream interface, assign handlers using the EventEmitter
|
||
|
`on` function in the normal fashion.
|
||
|
|
||
|
`error` - Indication that something bad happened. The error will be hanging
|
||
|
out on `parser.error`, and must be deleted before parsing can continue. By
|
||
|
listening to this event, you can keep an eye on that kind of stuff. Note:
|
||
|
this happens *much* more in strict mode. Argument: instance of `Error`.
|
||
|
|
||
|
`text` - Text node. Argument: string of text.
|
||
|
|
||
|
`doctype` - The `<!DOCTYPE` declaration. Argument: doctype string.
|
||
|
|
||
|
`processinginstruction` - Stuff like `<?xml foo="blerg" ?>`. Argument:
|
||
|
object with `name` and `body` members. Attributes are not parsed, as
|
||
|
processing instructions have implementation dependent semantics.
|
||
|
|
||
|
`sgmldeclaration` - Random SGML declarations. Stuff like `<!ENTITY p>`
|
||
|
would trigger this kind of event. This is a weird thing to support, so it
|
||
|
might go away at some point. SAX isn't intended to be used to parse SGML,
|
||
|
after all.
|
||
|
|
||
|
`opentagstart` - Emitted immediately when the tag name is available,
|
||
|
but before any attributes are encountered. Argument: object with a
|
||
|
`name` field and an empty `attributes` set. Note that this is the
|
||
|
same object that will later be emitted in the `opentag` event.
|
||
|
|
||
|
`opentag` - An opening tag. Argument: object with `name` and `attributes`.
|
||
|
In non-strict mode, tag names are uppercased, unless the `lowercase`
|
||
|
option is set. If the `xmlns` option is set, then it will contain
|
||
|
namespace binding information on the `ns` member, and will have a
|
||
|
`local`, `prefix`, and `uri` member.
|
||
|
|
||
|
`closetag` - A closing tag. In loose mode, tags are auto-closed if their
|
||
|
parent closes. In strict mode, well-formedness is enforced. Note that
|
||
|
self-closing tags will have `closeTag` emitted immediately after `openTag`.
|
||
|
Argument: tag name.
|
||
|
|
||
|
`attribute` - An attribute node. Argument: object with `name` and `value`.
|
||
|
In non-strict mode, attribute names are uppercased, unless the `lowercase`
|
||
|
option is set. If the `xmlns` option is set, it will also contains namespace
|
||
|
information.
|
||
|
|
||
|
`comment` - A comment node. Argument: the string of the comment.
|
||
|
|
||
|
`opencdata` - The opening tag of a `<![CDATA[` block.
|
||
|
|
||
|
`cdata` - The text of a `<![CDATA[` block. Since `<![CDATA[` blocks can get
|
||
|
quite large, this event may fire multiple times for a single block, if it
|
||
|
is broken up into multiple `write()`s. Argument: the string of random
|
||
|
character data.
|
||
|
|
||
|
`closecdata` - The closing tag (`]]>`) of a `<![CDATA[` block.
|
||
|
|
||
|
`opennamespace` - If the `xmlns` option is set, then this event will
|
||
|
signal the start of a new namespace binding.
|
||
|
|
||
|
`closenamespace` - If the `xmlns` option is set, then this event will
|
||
|
signal the end of a namespace binding.
|
||
|
|
||
|
`end` - Indication that the closed stream has ended.
|
||
|
|
||
|
`ready` - Indication that the stream has reset, and is ready to be written
|
||
|
to.
|
||
|
|
||
|
`noscript` - In non-strict mode, `<script>` tags trigger a `"script"`
|
||
|
event, and their contents are not checked for special xml characters.
|
||
|
If you pass `noscript: true`, then this behavior is suppressed.
|
||
|
|
||
|
## Reporting Problems
|
||
|
|
||
|
It's best to write a failing test if you find an issue. I will always
|
||
|
accept pull requests with failing tests if they demonstrate intended
|
||
|
behavior, but it is very hard to figure out what issue you're describing
|
||
|
without a test. Writing a test is also the best way for you yourself
|
||
|
to figure out if you really understand the issue you think you have with
|
||
|
sax-js.
|