Set up your own local W3C XML/HTML file validator

You can use the public W3C Markup Validation Service to validate your HTML and XML files. However there are times the need to set up your own local validator arises. For example, your web site is an internal one and may be inaccessible from W3C's server. You may also be concerned about uploading your internal files to a public server. Since the W3C validator is open source, you can download and install it by yourself. Although there is the installation documentation for the W3C Markup Validator, we describe in detail on how to install the validator as a non-root user and in a custom directory.

Prerequisites: LAMP development environment and supporting Perl packages

As a prerequisite we suggest you set up a basic LAMP (Linux, Apache, MySQL, PHP/Perl) development environment and install the Perl packages useful for web site development.

Get the source files and install them to the right place

You can download the latest tar balls for the W3C markup validator and the Document Type Definitions (DTDs). Following we assume that you work with the version 0.8.3.

Unpack the two tar balls to your build directory and they all go into the same directory validator-0.8.3. Browse the directory and get familiar with the file layout of the source distribution. To maintain the maximal flexibility, we want to put all the files to a separate directory under our web server's document root, e.g., $HTDOCS/validator. If you follow our article on setting up LAMP environment before, then $HTDOCS is /opt/dev/apache/htdocs.

Now you can create the $HTDOCS/validator directory and copy all the files under validator-0.8.3/htdocs to it. Basically the validator-0.8.3/htdocs directory contains all the files that will be read and served by the Apache server.

The workhorse for the markup validation is actually the one CGI script check in the validator-0.8.3/httpd/cgi-bin/ directory. You need to copy it to your Apache's actual cgi-bin/ directory ( $HTDOCS/../cgi-bin/ in our case). You need to at least modify the first line of the script to point to your local Perl executable (/opt/dev/perl/bin/perl in our case). Later on we will describe some additional changes required of this script. You can also copy another script sendfeeback.pl to your cgi-bin/ directory.

Next you need to rename the additional Apache configuration file for the validator, validator-0.8.3/httpd/conf/httpd.conf, to validator.conf and copy it to your Apache's configuration directory. In our case, it is under $HTDOCS/../conf. In your main Apache configuration file httpd.conf add the additional line Include conf/validator.conf to include it.

Now you need to modify the validator.conf file to provide the right locations for all the supporting files. For example, for the AliasMatch directive, you should change it to something like
AliasMatch ^/+w3c-validator/+check(/+referer)?$ /opt/dev/apache/cgi-bin/check
AliasMatch ^/+w3c-validator/+feedback(\.html)?$ /opt/dev/apache/cgi-bin/sendfeedback.pl

to match the locations of the CGI scripts. You also need to modify the Alias and the Directory directives to point to your $HTDOCS/validator directory:

Alias /w3c-validator/ /opt/dev/apache/htdocs/validator/
<Directory /opt/dev/apache/htdocs/validator/>
...

If you use mod_perl version 2.0 and it has been loaded before the validator.conf file is loaded, then you also need to comment out the block that contains
<IfDefine MODPERL2>

You may need to comment out the block that contains the Proxy directive as well if you don't configure Apache as a proxy server.

Install additional Perl packages

Now you can restart the Apache server and resolve any startup issues you may have. Once it restarts successfully, you can point your browser to http://<host>:<port_number>/w3c-validator/ and try it out. Mostly you will get the 500 Internal Server Error. In this case, you have to check the Apache error log ($HTDOCS/../logs/error_log in our case) to find out why it fails. To save you some time, we describe the additional changes you have to make.

First you need to install a series of additional Perl modules. Actually the CGI script depends a lot on them to do the heavy lifting. These are the additional Perl packages you need to install:

Config::General
Encode::HanExtra
Encode::JIS2K
HTML::Encoding
HTML::Template
SGML::Parser::OpenSP
XML::LibXML
Net::IP

Most of the above packages are easy to install. Here we only provide some more details on the SGML::Parser::OpenSP module. If you don't have OpenSP installed, you first need to download and install it from the OpenJade distribution site.

There seems to be some minor problem with the source code distribution of the OpenSP 1.5.1 and we have to make the following two changes to compile the code successfully.

First in the include/RangeMap.cxx file, we have to add the following line

#include "constant.h"

otherwise it will complain that wideCharMax is not defined which is in fact defined in the constant.h file.

Second we have to modify the include/InternalInputSource.h file and change this line

InternalInputSource *InternalInputSource::asInternalInputSource();

to

InternalInputSource* asInternalInputSource();

Once you install OpenSP, you need to download the SGML-Parser-OpenSP package manually. Unpack the tar ball and modify the Makefile.PL file to update $options{LIBS} to add the proper library path to OpenSP (-L/opt/dev/lib in our case) and the INC to add the proper include path to OpenSP (-L/opt/dev/include in our case). Then in the OpenSP.xs file, you also need to comment out the following two lines

if (_hv_fetch_SvTRUE(hv, "show_error_numbers", 18))
  pk.setOption(ParserEventGeneratorKit::showErrorNumbers);

because OpenSP's interface changes a bit.

Now you can run

perl Makefile.PL
gmake
gmake install

to install the SGML-Parser-OpenSP package.

After you install all the required Perl packages, you still have some more hurdles to overcome. :) Check your Apache error log for the details. Here we outline the changes to save your time.

Update configuration files for the chosen layout

First you need to copy the template directory validator-0.8.3/share to your $HTDOCS/validator directory so Apache can access the templates to generate the resulting HTML files that make up the validator frontend pages.

Second you need to modify the $HTDOCS/validator/config/validator.conf configuration file to make sure that the right paths are set. Please note that this is a different configuration file from the one used by Apache ($HTDOCS/../conf/validator.conf). To be specific, you need to modify the Base, Templates, and Library settings in the Paths section.

In our case, the right settings are

Base = /opt/dev/apache/htdocs/validator
Templates = $Base/share/templates
Library = $Base/sgml-lib

I know you have gone through a long journey to come to the end. Now restart Apache, open your favorite browser and point to http://<host>:<port_number>/w3c-validator/. Voila, it works and you can start validating your markup files with your own server now.

Automate site validation

Once you test that your local W3C markup validator works, you should consider automating this process for all the pages you own. Again you can use the powerful WWW::Mechanize perl package to do the heavy lifting. You can also combine it with the Test-Simple perl package to write test scripts that can become part of your site's test plans.

To begin with, you first need to compile the URI list of all of your markup pages which you can get easily from your sitemap file. If you haven't created a sitemap file yet, we suggest you do so and you can find more details from our article on submitting your web site to the popular search engines. Then in your script you can just iterate through all the URIs and then use WWW::Mechanize to submit it to your local W3C markup validator.

You can download the sample script that uses WWW::Mechanize to automate the validation of yuonlamp.com's markup pages and adapt it to your local environment.

Back to articles on setup