SgmlReader is an XmlReader API over any SGML document (including built in support for HTML). A command line utility is also provided which outputs the well formed XML result.
Download the zip file including the standalone
executable and the full source code: SgmlReader.zip
See online demo at demo.aspx.
See also online
source.
The command line executable version has the following options:
sgmlreader <options> [InputUri] [OutputFile]
-e "file" Specifies a file to write error output to. The default is to generate no errors. The special name "$stderr" redirects errors to stderr output stream. -proxy "server" Specifies the proxy server to use to fetch DTD's through the fire wall. -html Specifies that the input is HTML. -dtd "uri" Specifies some other SGML DTD. -base Add an HTML base tag to the output.
-pretty Pretty print the output. -encoding name Specify an encoding for the output file (default UTF-8) -noxml Stops generation of XML declaration in output. -doctype Copy <!DOCTYPE tag to the output. InputUri The input file name or URL. Default is stdin. If this is a local file name then it also supports wildcards. OutputFile The optional output file name. Default is stdout. If the InputUri contains wildcards then this just specifies the output file extension, the default being ".xml".
The SgmlReader is an implementation of the XmlReader API so the only thing you really need to know is how to construct it. SgmlReader has a default constructor, then you need to set some of the following properties. To load a DTD you must specify DocType="HTML" or you must provide a SystemLiteral. To specify the SGML document you must provide either the InputStream or Href. Everything else is optional.
Then you can read from this reader like any other XmlReader class.
SGML CDATA to XML <![CDATA[...]]> conversion
SGML DTD's describe a special DTD element type named "CDATA". This is used in HTML for <SCRIPT> for example and the contents of the script block can be any text terminated by </SCRIPT> including script code containing "<" symbol and so forth, but this would not be well formed in an XML document so the contents of the script block are automatically converted to an XML CDATA block.
Please email bugs, feedback and/or feature requests to Chris Lovett.
| Version | Description |
|---|---|
| 1.7 |
Fix lots of reported bugs:
|
| 1.6 | Improve wrapping of HTML content with auto-generated <html></html> container tags. |
| 1.5 |
Fix detection of ContentType=text/html and switch to HTML mode. |
| 1.4 | Added UserAgent string "Mozilla/4.0 (compatible;);" so that SgmlReader gets the right content from webservers. Fixed handling of HTML that does not start with root <html> element tag. Fixed handling of built in HTML entities. |
| 1.3 |
Changed ToUpper to CaseFolding enum and added support for "auto-folding" based on
input. |
| 1.2 |
Converted back to Visual Studio 7.0 since this is the lowest common denominator.
Added ToUpper switch for upper case folding, instead of the default lower case. Fix handling of UNC paths. Added OFX test suite. Fixed bug in parsing CDATA type elements (like <script><!-- --></script>) |
| 1.1 |
Upgraded project to Visual Studio 7.1. |
| 1.0.4 | Added -encoding option so you can change the encoding of the output file. |
| 1.0.3.26932 | Implemented ReadOuterXml and ReadInnerXml and fix some bugs in dealing with xmlns attributes and dealing with non-HTML tags. |
| 1.0.3 | Fixed some CLS compliance problems with using SgmlReader from VB and a null reference exception bug when loading SgmlReader from XmlDocument |
| 1.0.2.21225 | Fixed bug in handling of encodings. Now uses the correct encoding returned from the HTTP server |
| 1.0.2.21105 | Fixed bug in handling of input that contains blank lines at the top. |
| 1.0.2 | Added fix for the way IE & Netscape deal with characters in the range 0x80 through 0x9F in HTML. |
| 1.0.1 | Fixed bug in handling of empty elements, like <INPUT> |
| 1.0 | Add wildcard support for command line utility. |
| 0.5 | Initial |