One of my previous customers has a Jakarta Tapestry [3.0.x] based site. The site is subscription-based, but it also has a public area – if you browse each and every link you should be able to view few thousand of [dynamically generated] pages. No SEO* consulting was involved in building the site. To put it simply : I’ve got some specs and HTML templates: developed, deployed, bugfixed and hasta la vista…More than 6 months later [!], the site is still alive, which is good, but it doesn’t really spot impressive traffic figures and growth. Basically, all the traffic it gathers seems to come from existing subscribers and paid ads, very low level of organic and almost zero traffic from major engines such as Google (although it was submitted to quite a lot of engines and directories).
Lo and behold, there must be something really nasty going on since a quick query on Google with site:www.javaguysite.com** gives only one freaking result : the home page. What means: Google has indexed ONLY the entry page – same thing happens with all the other major search indexes. And guess what : nobody is going to find the content if it isn’t even indexed by search engines.Making friends with your public URLs The problem : Tapestry URLs are too ugly for search engines. Looking at the source of my navigation menu, I found little beasts such as http://www.javaguysite.com/b2b?service=direct/1/Home/\$Border.\$Header.\$LoggedMenu.\$CheckedMenu\$1.\$DirectLink&sp=SSearchOffers#a For a Tapestry programmer it is simple direct link from inside a component embedded in other components, but for search engine bots it is an overly complex link to a dynamic page, which will NOT be followed. Thus, if you want these little buggers to crawl all over your site and index all the pages, make’em think it’s a simple static site such as : http://www.javaguysite.com/b2b/bla/bla/bla/SearchOffers.html In SEO consultants slang, it’s called “friendly URLs”***. You don’t have to make all your links friendlier. For instance, no need to touch the pages available only to subscribers as they’ll never be available for public searching. In the public area, make friendly URLs only to access those pages containing relevant content.
The method is called URL rewriting. Rewriting means that the web server is transforming the request URL, behind the scenes, using regular expressions, in a totally transparent manner. Thus, the client browser or the bot “thinks” it reaches a certain URL, however a different address is sent to the servlet container. The rewriting is performed either by:1. using a servlet filter such as urlrewrite. or 2. with mod_rewrite in Apache. I do use Apache as a proxy server, in order to perform transparent and efficient gzip-ing on the fly as described in one of my previous blog posts. Now, I only had to add the mod_rewrite filter and I’m ready to go. Only minor syntax differences exists between the regular expressions in the filter and the Apache module. I was able to seamlessly switch between the two, as I use the servlet filter in development environment and we did Apache proxying in production. The devil is in the details Now we’re sure that dynamic pages from the public area will be searchable after the Google bot crawls them. Problem is : all pages of a single category will have the same title. Like for instance “Company details” for all the pages containing … umm, company details. And when you have thousands of companies in the database, that makes helluva lot of pages with the same title ! Besides, keywords contained in the page title play an important role in the search position for the specific keyword. The conclusion: make your page titles as customised as possible: put not only the page type, but also relevant data from the page content – in our case, the company name and whynot the city where the business is located. This is easy with Tapestry: and then define a customized public String getPageTitle(); in all the Page classes (with maybe an abstract getPageTitle in the base page class, supposing you have one defined in the project, which one normally should). The same type of reasoning applies for page keywords and description metatags, as they are taken into account by most of the search engines. Use them, make them dynamic and insert content-relevant data. Don’t just rely on generic keywords as the competition is huge on these: a bit of keyword long tail can do wonders. Don’t overdo it and don’t try to trick Google as you may have some nasty surprises in the process. And if you can afford, have some SEO consulting for the keywords and titles content. There’s another rather obsolete nevertheless important HTML tag : the H1. Who on Earth needs H1 when you got CSS and can name your headings according to a coherent naming schema ? Well, apparently Google needs H1 tags, reads H1 tags and uses the content of H1 tags to compute search relevancy. So make sure to redefine H1 in your CSS and use it inside page content. People seem to believe it has something to do with a sort of HTML semantics … That’s it, at least from a basic technical point of view. For a longer discussion about SEO read Roger Johansson’s Basics of search engine optimisation as well as the massive amounts of freely available literature available on the web concerning this subject. Just … search for it. *SEO = Search Engine Optimization. **Names changed to protect the innocents. ***Supposedly, Tapestry 3.1, currently in alpha stage, has its URLs way friendlier than 3.0. However, don’t use an alpha API on a production site.