What's all this about LibTidy?
LibTidy, like it sounds, is a library version of Dave Raggett's popular HTML Tidy. In fact, one of the motivations for starting the Source Forge project was to refactor HTML Tidy as a callable library. Although the command line tool is great, it is difficult and inefficient to integrate into other software.
Requirements
We had several informal requirements for the library:
- You Can Get There From Here
Probably the most important requirement is that the library be easy to integrate. Because of the almost universal adoption of C linkage, a C interface may be called from a great many programming languages. This, and the fact that code was already in C and the team was already most comfortable with C, led to the decision that the library's public interface should be kept in C.
The other major design decision was to use opaque types in the public interface. This allows the application to just pass in integer around and the need to transform data types in different languages is minimized.
This strategy has already paid off. It was straight-forward to write very thin library wrappers for C++, Pascal, and COM/ATL. It was also quick to generate a Perl wrapper using SWIG. SWIG wrappers for Python, Ruby, Java and others should also be possible.
- Don't Break Anything
Of course, Tidy must remain Tidy. It wasn't acceptable to introduce bugs or drop (many) features. In the end, the body of test documents proved invaluable to getting things working.
- Thread Safe / Reentrant
Because there are many uses for HTML Tidy - from content validation, content scraping to conversion to XHTML - it was important to make LibTidy run reasonably well within server applications as well as client side.
This requirement implies that the library be fully re-entrant so that it may be used within multi-threaded applications.
- Adaptable I/O
As part of the larger integration strategy, it was decided to fully abstract all I/O. This means a (relatively) clean separation between character encoding processing and shovelling bytes back and forth. Internally, the library reads from "sources" and writes to "sinks". This abstraction is used for both markup and configuration "files". Concrete implementations are provided for file and memory I/O. But new sources and sinks may be provided via the public interface.
We had some prior art to follow as well. Most notably, Marc-Andre Lemburg's mxTidy. In the process of writing a Python wrapper for Tidy, Marc-Andre applied these principles and built a C library. LibTidy can be seen as a completion of Marc's work.
This Getting Started content is obsolete. Please see consult current documentation in our source repository, and then skip forward to Example.
Getting Started
Get The Source
The best way to get the lib sources is directly from CVS. If you have CVS installed (recommended!), just execute the following commands:
C:\src> mkdir tidylib C:\src> cd tidylib C:\src\tidylib> set TIDYCVSROOT=:pserver:[email protected]:/cvsroot/tidy C:\src\tidylib> cvs -d %TIDYCVSROOT% login C:\src\tidylib> cvs -d %TIDYCVSROOT% export -d C:\src\tidylib -r HEAD _ build console htmldoc include src test
When CVS prompts you for the password, just hit ENTER. The underscore (_) above denotes line continuation. Do not type it in, just use one long command line. The procedure is similar for Unix variants. Just translate to the appropriate path separator for your file system and do not use the -d <dir> option. Copy and paste the above into a script or batch file. For the truly lazy, you can pull a gzipped source tarball from the Tidy Project Page.
Build It
For an overview of build options, see build/readme.txt. It describes the overall layout and more info on supported build systems.
Unix / GNU
For GNU gcc, just use the gmake build/gmake/Makefile.
The usual target is all. If you want a debug build, use
the debug target. For other Unix compilers, you may have
to set the CC macro to point to your compiler, usually just
cc. The same, large number of Unix systems are supported
"out of the box" as Tidy Classic. Tidy usually does a good job of
automatically identifying the current platform. If not, tweak
platform.h as needed and send us a patch!
If you are using GCC/MinGW, you should use gmake as well.
In addition, there are targets for clean and
install. Be sure to look at the Makefile before using
install to make sure the binaries, headers and library
go where you want. By default, /usr/bin,
/usr/include, and /usr/lib, respectively.
There are macros in the Makefile to customize your installation.
make all
Windows / Visual C++
For VC++, use you can use either msvc/Makefile.vc6 on
the command line or build/msvc/tidy.dsw in the IDE. As
the names imply, these work with Visual C++ version 6.0. Service
Pack 3 is highly recommended. Makefile.vc6 supports the same targets:
all, debug, clean and
install are all available.
nmake /f Makefile.vc6 all
GNU AutoConf/AutoMake
The input files to drive the GNU AutoConf tool set have been added.
See build/gnuauto/readme.txt for instructions on how to
use GNU build tools with Tidy.
Example
Perhaps the easiest way to understand how to call Tidy is to see a simple program that uses it. A basic thing to know about the API is that functions that return an integer use the following values:
- 0 == Success
Good to go.
- 1 == Warnings, No Errors
Check error buffer or track error messages for details.
- 2 == Errors and Warnings
By default, Tidy will not produce output. You can force output with the
TidyForceOutputoption. As with warnings, check error buffer or track error messages for details.- <0 == Severe error
Usually value equals
-errno. See errno.h.
Also, by default, warning and error messages are sent to stderr.
You can redirect diagnostic output using either tidySetErrorFile()
or tidySetErrorBuffer(). See tidy.h for details.
#include <tidy.h>
#include <buffio.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv )
{
const char* input = "<title>Foo</title><p>Foo!";
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
int rc = -1;
Bool ok;
TidyDoc tdoc = tidyCreate(); // Initialize "document"
printf( "Tidying:\t%s\n", input );
ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ); // Convert to XHTML
if ( ok )
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
rc = tidyParseString( tdoc, input ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
if ( rc >= 0 )
{
if ( rc > 0 )
printf( "\nDiagnostics:\n\n%s", errbuf.bp );
printf( "\nAnd here is the result:\n\n%s", output.bp );
}
else
printf( "A severe error (%d) occurred.\n", rc );
tidyBufFree( &output );
tidyBufFree( &errbuf );
tidyRelease( tdoc );
return rc;
}
Look Ma, no temp files!
Application Notes
Of course, there are functions to parse and save both markup and configuration files. For the adventurous, it is possible to create new input sources and output sinks. For example, a URL source could pull the markup from a given URL.
It is also worth rememebering that
an application may instantiate any number of document and
buffer objects. They are fairly cheap to initialize and destroy (just
memory allocation and zeroing, really), so they may be created
and destroyed locally, as needed. There is no problem keeping them
around a while for keeping state. For example, a server app might
keep a global document as a master configuration. As documents are
parsed, they can copy their configuration data from the master
instance. See tidyOptCopyConfig(). If the master copy is
initialized at startup, no synchronization is necessary.
API Docs
Several API Docs have been added to Tidy header files and generated using Doxygen.
This Nightly Build content is obsolete.
Nightly Build
The build procedures on the Source Forge Compile Farm have been updated to produce the command line driver based on the library sources. See Tidy Binaries.
This Future Directions content is obsolete. Please consult our roadmap in our Community repository.
Future Directions
The ink isn't dry yet on LibTidy and already folks want more! Well, waddaya expect? Several ideas have been discussed on the dev mailing list.
- Character Encoding
Currently, all character encoding support is hard wired into the library. This means we do a poor job of supporting many popular encodings such as GB2312, euc-kr, eastern European languages, cyrillic, etc. Any of these languages must first be transcoded into ISO-10646/Unicode before Tidy can work with it.
Two basic approaches have been proposed: just use iconv or adapt Clark Coopers's XML::Encoding as a callable library. On the face of it, iconv is preferable. Because it is GPL'ed, however, the license may be incompatible. Also, there are transcription issues related to Big5 and other code sets that may or may not be addressed by iconv. XML::Encoding, otoh, uses the Perl Artistic License and explicitly supports all alternate transcriptions for Big5 and others. For more info, see CPAN and Tidy Issues.
- Error Handling
-
- Categorize errors
- Improve message localization
- Improve separation of parsing and diagnostics
- Content Model
-
- Per-element-and-version attribute support
- DTD Internal Subset support
- Modular XHTML support (XHTML 1.1)
Editorial changes on 23 November 2015 by J. Derry
Page last updated on 26 November 2002 by C. Reitzel