SgmlReader is a versatile .NET library written in C# for parsing HTML/SGML files. The original community around SgmlReader used to be hosted by GotDotNet, but it has been phased out (update: it appears the code has re-surface on MSDN Code Gallery, but without any updates). MindTouch Dream and MindTouch Deki use extensively the SgmlReader library. We found and fixed a few bugs in it as well. In the spirit of the original author, we're providing back these changes on the MindTouch Developer Center site.
Sample Usage
The following code parses a HTML into an XmlDocument:
XmlDocument FromHtml(TextReader reader) {
// setup SgmlReader
Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = reader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);
return doc;
}
Release History
Release notes for 1.8.2
- fixed regression introduced by fixing bug 5150
- (bug 5443) an extra open quote/double-quote prevents the entire element from being read properly
- replaced == string equality with culture invariant string.Compare
- return 'null' as NameTable since none is used
- added '-noformat' switch for regression tests to suppress automatic reformatting (useful for formatting tests)
Release notes for 1.8.1
- (bug 5144) Unclosed HTML comment causes infinite loop
- (bug 5150) don't use XmlNameTable with object comparisons; it becomes unreliable after a while
Release notes for 1.8.0
- BREAKING CHANGE: requires .NET 2.0
- major code clean-up (thx jamesgmbutler for the contribution!)
- (bug 4606) Add XML-only entity ' to HTML DTD
Release notes for 1.7.5
- (bug 4410) Missing quote in attribute value causes catastropic failure
- (bug 4409) Unknown prefixes cannot be mapped to the same namespace
Release notes for 1.7.4
- (bug 4179) ² entity is not recognized correctly
- added test for entities with digits
Release notes for 1.7.3
- never close the BODY tag early (it causes loss of content)
- remove "<![CDATA[" inside CDATA sections
- remove "]]>" inside CDATA sections
- (bug 3513) convert elements with invalid tag names into text (e.g. <foo@bar.com>)
Release notes for 1.7.2
- fixed bug where parsing CDATA section skipped first character
- don't double parse commented out CDATA sections
- added support for namespaces on elements and attributes
- unknown prefixes on attributes and elements resolve to '#unknown' namespace
- fix bug when parsing down-level comments, like <![if IE]>
- don't allow attribute with invalid names (e.g. <p foo:="invalid" ;="bad">, etc.)
Release notes for 1.7.1
- added 'GetLiteralEntitiesLookup()' method
- fixed bugs with namespace prefixes on attributes and elements; prefixes are now stripped automatically
- added SgmlReader constructor with XmlNameTable argument to avoid failed comparisons when reusing the DTD
- ensured that SgmlReader is initialized identically when reusing a DTD
Release notes for 1.7
- Fix bug reported by chriswang - MoveToAttribute didn't save state properly.
- Fix bug reported by starascendent - build on Visual Studio 2003 was broken.
- Fix bug reported by sanchen - ExpandCharEntity was messed up on hex entities.
- Fix bug reported by kojiishi - off by one bug in SniffName()
- Fix bug reported by kojiishi - bug in loading XmlDocument from SgmlReader - this was caused by the HTML documernt containing an embedded <?xml version='1.0'?> declaration, so the SgmlReader now strips these.
- Added special stripping of punctuation characters between attributes like ",".
Release notes for 1.6
- Improve wrapping of HTML content with auto-generated <html></html> container tags.
Release notes for 1.5
- Fix detection of ContentType=text/html and switch to HTML mode.
- Fix problems parsing DOCTYPE tag when case folding is on.
- Fix reading of XHTML DTD.
- Fix parsing of content of type CDATA that resulted in the error message 'Cannot have ']]>' inside an XML CDATA block'.
- Fix parsing of http://www.virtuelvis.com/download/162/evilml.html.
- Fix parsing of attributes missing the equals sign: height"4" (thanks to Ulrich Schwanitz for his fix).
- Fix 'SniffWhitespace' thanks to "Windy Winter".
- Added TestSuite project.
Release notes for 1.4
- Added UserAgent string "Mozilla/4.0 (compatible;);" so that SgmlReader gets the right content from webservers. Fixed handling of HTML that does not start with root <html> element tag. Fixed handling of built in HTML entities.
Release notes for 1.3
- Changed ToUpper to CaseFolding enum and added support for "auto-folding" based on input.
- Added support for <![CDATA[...]]> blocks.
- Added proper encoding support, including support for HTML <META http-equiv="content-type". This means output now has the correct XML declaration (unless you specify the new -noxml option) and any existing xml declarations in the input are stipped out so you don't end up with two.
- Added support for ASP <%...%> blocks (thanks to Dan Whalin).
- Now strips out DOCTYPE by default since HTML DocTypes can cause problems for XmlDocument when it tries to load the HTML DTD. but added "-doctype" switch for those who really need it to come through.
- Fix handling of Office 2000 <?xml:namespace .../> declarations.
- Remove bogus attributes that have no name, in cases like <class= "test">.
Release notes for 1.2
- Converted back to Visual Studio 7.0 since this is the lowest common denominator.
- Added ToUpper switch for upper case folding, instead of the default lower case.
- Fix handling of UNC paths.
- Added OFX test suite.
- Fixed bug in parsing CDATA type elements (like <script><!-- --></script>)
Release notes for 1.1
- Upgraded project to Visual Studio 7.1.
- Fixed bug in accessing https authenticated sites.
- Fixed bug in handling of content that contains nulls.
- Improved handling of <!DOCTYPE with PUBLIC and no SYSTEM literal.
- Fixed bug in losing attributes when auto-closing tags.
- Fixed pretty printing output by adding WhitespaceHandling flag to SgmlReader.
Release notes for 1.0.4
- Added -encoding option so you can change the encoding of the output file.
Release notes for 1.0.3.26932
- Implemented ReadOuterXml and ReadInnerXml and fix some bugs in dealing with xmlns attributes and dealing with non-HTML tags.
Release notes for 1.0.3
- Fixed some CLS compliance problems with using SgmlReader from VB and a null reference exception bug when loading SgmlReader from XmlDocument
Release notes for 1.0.2.21225
- Fixed bug in handling of encodings. Now uses the correct encoding returned from the HTTP server
Release notes for 1.0.2.21105
- Fixed bug in handling of input that contains blank lines at the top.
Release notes for 1.0.2
- Added fix for the way IE & Netscape deal with characters in the range 0x80 through 0x9F in HTML.
Release notes for 1.0.1
- Fixed bug in handling of empty elements, like <INPUT>
Release notes for 1.0
- Add wildcard support for command line utility.