SgmlReader is a versatile .NET library written in C# for parsing HTML/SGML files.  The original community around SgmlReader used to be hosted by GotDotNet, but it has been phased out. MindTouch Dream and MindTouch Deki use extensively the SgmlReader library.  We found and fixed a few bugs in it as well.  In the spirit of the original author, we're providing back these changes on the MindTouch Developer Center site.

The latest version of SgmlReader can be downloaded on SourceForge.Net or from our public SVN repository.  If you find/fix issues in SgmlReader, please post in the SgmlReader forum.

The following sample code parses a HTML into an XmlDocument:

XmlDocument FromHtml(TextReader reader) {
    // setup SgmlReader
    Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
    sgmlReader.DocType = "HTML";
    sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
    sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
    sgmlReader.InputStream = reader;
    // create document
    XmlDocument doc = new XmlDocument();
    doc.PreserveWhitespace = true;
    doc.XmlResolver = null;
    doc.Load(sgmlReader);
    return doc;
}

Release History

Release notes for 1.8.0

  • BREAKING CHANGE: requires .NET 2.0
  • major code clean-up (thx jamesgmbutler for the contribution!)
  • (bug 4606) Add XML-only entity ' to HTML DTD

Release notes for 1.7.5

  • (bug 4410) Missing quote in attribute value causes catastropic failure
  • (bug 4409) Unknown prefixes cannot be mapped to the same namespace

Release notes for 1.7.4

  • (bug 4179) ² entity is not recognized correctly
  • added test for entities with digits

Release notes for 1.7.3

  • never close the BODY tag early (it causes loss of content)
  • remove  "<![CDATA[" inside CDATA sections
  • remove "]]>" inside CDATA sections
  • (bug 3513) convert elements with invalid tag names into text (e.g. <foo@bar.com>)

Release notes for 1.7.2

  • fixed bug where parsing CDATA section skipped first character
  • don't double parse commented out CDATA sections
  • added support for namespaces on elements and attributes
  • unknown prefixes on attributes and elements resolve to '#unknown' namespace
  • fix bug when parsing down-level comments, like <![if IE]>
  • don't allow attribute with invalid names (e.g. <p foo:="invalid" ;="bad">, etc.)

Release notes for 1.7.1

  • added 'GetLiteralEntitiesLookup()' method
  • fixed bugs with namespace prefixes on attributes and elements; prefixes are now stripped automatically
  • added SgmlReader constructor with XmlNameTable argument to avoid failed comparisons when reusing the DTD
  • ensured that SgmlReader is initialized identically when reusing a DTD

            

Tag page
You must login to post a comment.
Powered by MindTouch Deki Enterprise Edition v.8.08 RC2