Saturday, November 10, 2007

Parsing HTML with java

Sometimes it might be necessary to parse HTML to extract some data out of it. Practical requirements include extracting certain ID out of the HTML among other things. This can be a problem since HTML is not well formed. HTML is full of tags that need not be closed such as the br tag. To get around this, use the HTMLEditorKit. The kit can also help you integrate a HTML solution with Swing. Here is some code

HTMLEditor kit parser:

public class HTMLParser
public static void main(String [] args) throws Exception
HTMLEditorKit.ParserCallback callback = new CallBack();
Reader reader = new FileReader("d:/test.html");
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);

// Implement the call back class. Just like a SAX content handler
class CallBack extends HTMLEditorKit.ParserCallback
Stack stack = new Stack();
public void flush() throws BadLocationException{}
public void handleComment(char[] data, int pos){}

public void handleStartTag(HTML.Tag tag, MutableAttributeSet a, int pos)
// get a tag and push it into a stack
System.out.println("Tag: " + tag );

public void handleEndTag(HTML.Tag t, int pos){}
public void handleSimpleTag(HTML.Tag t,MutableAttributeSet a, int pos){}
public void handleError(String errorMsg, int pos){}
public void handleEndOfLineString(String eol){}

public void handleText(char[] data, int pos)
// pop the stack to get the latest tag processed. If you are interested
// in parsing it and extracting the data continue. else return
Object o = stack.pop();
if ( ! ((HTML.Tag)o).toString().equals("span"))
String strData="";
for (char ch : data)
strData = strData + ch;
System.out.println("Text: " + strData );
The parser will tolerate tags that are not closed.

If you would prefer a DOM solution to the parser problem have a look at jTidy

A DOM solution is appropriate for HTML documents that are not too huge and require random access + modifications in memory. I have not tried jTidy myself. Lack of documentation made me stay away. The documentation available at source forge was pretty bad. Sample programs that where the lines of code were all fused into a continuous set of characters.

Another DOM like solution is HTML-Parser. Here is the link

This parser is more powerful. You can use a light weight or heavy duty solution depending on your requirement. Here is some code for a light weight Lexer parser. Documentation for this parser was pretty good.

Lexer code (click to enlarge):
New Document
Here is the output:



New Document




One more solution is to use the swing Parser class. Here is some code

Swing parser:
DTD dtd = DTD.getDTD("html.dtd");
Parser parser = new Parser(dtd )
protected void handleText(char[] data)
String str = "";
for (char ch : data)
str += ch;
System.out.println("Text: " + str);

protected void startTag(TagElement element) throws ChangedCharSetException
System.out.println("Start tag: " + element.getElement().getName());

parser.parse(new FileReader(new File("d:/test2.html")));

Start tag: html
Start tag: head
Start tag: title
Text: New Document
Start tag: body
Start tag: test
Text: testamondo
Start tag: h1
Text: Big Header

This parser is DTD driven. It is more suited to a SAX type solution.

In conclusion

  • Use the HTML Parser when you need complex operations to be performed. You can choose between light weight and heavy duty implementations
  • Use the HTMLEditor kit or the swing Parser when you intend to simply parse and read the HTML for specific data.

I am refraining from suggesting jTidy. I have not found any documentation as yet that will let me compare it with the other parsers. If I do I will update this article.

Monday, October 29, 2007

ZoneInfo error ?

I ran into this error some time back when I was loading a Sybase driver into my application code using Class.forName()

ZoneInfo: null\lib\zi\ZoneInfoMappings (The system cannot find the path specified)

Analysis of the stack trace revealed that the system seemed to be trying to find out what time zone my JVM belonged to. The trace also mentioned that it was unable to find java.home. This was quite weird since my computer was in the right time zone and the java.home property was not set anywhere in my code or the eclipse runtime property set.

After searching for a while I found out where the problem was. I was loading some properties into my system properties and while loading my properties the JVM wiped out the existing properties, thus creating the error. I wanted to use the name value pairs inside a .properties file and then have my application look up a system property every time it needed a configuration detail. Here was the problem

Not so good code:

java.util.Properties props = new java.util.Properties();
props.load(new FileInputStream(new File(fileName)));
System.setProperties(props); // Dont do this.

The System.setProperties(props) method wiped out existing properties (like java.home or time zone information ) and wrote my properties into the system properties. I got around the problem by setting the system properties one by one in a loop using the System.setProperty() method.

Better code:

Iterator iterator = props.keySet().iterator();
String key =;
String value = props.getProperty(key);
System.out.println(key + " " + value);

However the approach shown above can be inefficient and spans a couple of lines as well. The Properties class extends from the Hashtable class thus making it a Map. The following code can be used to add properties to existing system properties in the best way possible.

Best code:

java.util.Properties props = new java.util.Properties();
props.load(new FileInputStream(new File(fileName)));
System.getProperties().putAll(properties); // Does not replace existing property keys in system properties but will rather replace their values alone, if they already exist.

I should also mention that this problem is not always consistent. Sometimes the JVM wipes out the existing properties and other times it does not. With JDK 1.3.1 hosted on windows 2000 server I received this error but with JDK 1.5 and windows XP this error does not occur. Perhaps it was fixed in JDK 1.5 or the environment plays a role here. However it is better to avoid such random behavior if we are to write production quality code

You might also want to consider saving an instance of the Properties class or loading the name value pairs into a resource bundle in case you do not want to load runtime configuration into a system property..

Sunday, October 21, 2007

SOA Certifications

When I think about taking certifications that are based on SOA I am put off quite badly. Here are a list of certifications that I could probably get my hands on

  • Sun - Java web services (Not really SOA, but covers a part of it)
  • BEA's SOA certification - Currently I work with this server
  • IBM's SOA certification

Sun's certification is quite neutral, in that it does not attempt to push through its own products in the name of certifying you. However this certification is quite outdated. The last time I wanted to host a web service I used apache Axis. More recently after moving to weblogic 9, the integrated web services look promising and they are available via the workshop IDE for weblogic as well. This certification does not give as much juice as you would like to have.

BEA's SOA certification strategy was starting to look promising until I started hunting for their study material. The courses were pretty costly and there was no way to pass the certification unless I bought them. This really put me off. If a client were to approach me and ask me how they could service enable their apps, I might end up saying 'uuhhhhhh......'. This is not to say that their SOA resource center is bad. They do have a couple of articles on what their products are. However there is no string of articles that lists them down, teaches you what they can do and cant, and when and where you can apply them. I cannot pay the earth for the courses that they offer.

IBM's SOA certifications are divided in layers. From associate, you climb your way up. The training material is available via the IBM website and it was much more easily accessible. A few require you to shell some cash, but if your company is partnering with IBM, you can get them for free. My point is that their products and view of SOA are quite accessible. I do not use IBM products at my current project however. So that leaves this certification out as well.

Any certification should aim at giving the candidate an overview and boundaries of a technology / product. Or in case of architecture certifications, much more than that, a clear understanding of what the technologies are and how to put them together. Right now SOA is a mix of both. It does have some standards like the BPEL and common use of SOA solutions like webservices with SOAP etc. However each vendor has their own view of what SOA really is. A SOA certification at this point in time is certainly going to be targeted at a particular vendor. You learn about the vendor's product more than SOA itself.

Currently IBM seems to be pushing for a so called SOA foundation. I attended a seminar where Rob High, chief architect for the IBM SOA foundation spoke of how they were aiming to standardize SOA, programming models, definitions, and the like. That is quite interesting. I wonder how that will turn out. Perhaps we can have an open standard and allow vendors to implement it in their own way, much like the JVM, MQ, JNDI technologies of today. That might still not help SOA certifications be vendor neutral, but atleast they will be less divergent in their views of SOA.

Tuesday, October 16, 2007

Recursively recursing xsl

If you have ever programmed in XSLT you will have come across the problem of not being able to reassign variable values. XSLT is declarative in nature when it comes to variables. In the sense, you can assign a variable once and then the variable is immutable. To get around that you can write recursive functions that do the math for you by returning the desired result in steps. Such functions are pretty darn solid and work well once written, since variables inside them are immutable. Here's an example


Evaluate the sum of the prices in this xml using any one of these XLSTs. Various recursive logic can be employed.

Recursively add each price by selecting one price at a time:

This logic is pretty simple. Here is how it works
  • Get first price
  • Get second price + first price
  • Get third price + second price + first price
and so on until the total count of prices are reached. In this case it is 5. The terminating condition is to return the last price instead of making a recursive call again

Recursively manipulate a node list.
Add the first price to a recursive result each time:

With this logic you begin with a node list of 5
Products. This is what happens
  • Get price of first product
  • Reduce node list size by moving to the next product
  • The Node list size is now 4. First product has been removed
  • Result = Price of first product + Recursively call the same method
  • Get price of first product (which used to be the second product in first iteration)
  • Reduce node list size by moving to the next product
  • The Node list size is now 4. First product has been removed
  • Result = Price of first product(1st Iter) + Price of first product(2nd Iter) + Recursively call the same method
and so on... The terminating condition is to check for the existence of any nodes in the product list. If there are none return 0. So the addition becomes product1+ product2 +... +product5 +0 You should prefer tail recursion where possible since optimizers can execute them in iterations. More on tail recursions here

You can also use the element where applicable. The trick is to know where to use recursion and where not to. For example factorials are not such a bad example of recursion. With factorials you return values like this

if last number
return number;
return number * fac(number-1);

This is ok. The method call stack is proportional to the number of numbers to find the factorial for. That is 5! requires 5 method calls on the call stack.

Method stack:
  • 5 * 4!
  • 4 * 3!
  • 3 * 2!
  • 2 * 1!
  • 1
This becomes a bad idea with fibonacci. Here is some logic to do fibonacci math in recursion

if number <= 1 return 1; else return fibo(number-1) + fibo (number-2); If I want to find the 4th fibonacci number, this is how the logic divides it

Method stack:

  • 4 -> fibo (3) + fibo(2)
  • 3 -> fibo(2) + fibo(1)
  • 2 -> fibo(1) + fibo(0)
  • 1-> 1
  • 0 -> 1

Thats a lotta calls. 8 in total. This logic tries to divide each number into an addition of 1s of that same number. That is -> 3 = 1 + 1 + 1. A large number will take forever to calculate using this logic. It is better to not use recursion for cases like this. You also dont want recursion or XSLT to do too much processing if the browser is going to to the work of translating the stuff into HTML for you.

For more advanced topics on recursion you might want to look at this link

Friday, October 12, 2007

Solving the erroneous handlers error for weblogic

I was running my weblogic 9 server at work and one of the JSP pages threw this at me.

java.lang.InternalError: erroneous handlers at com.sun.facelets.compiler.TextUnit.addChild( at com.sun.facelets.compiler.CompilationManager.startUnit( at com.sun.facelets.compiler.CompilationManager.pushNamespace( at com.sun.facelets.compiler.SAXCompiler$CompilationHandler.startPrefixMapping(

After being puzzled for a few minutes I figured that I was using the JROCKIT BEA VM to run my weblogic server and it was not giving me a proper description of what was actually wrong. If you encounter the stack trace shown above in your application (or something with erroneous handlers in it), do the following from weblogic workshop to learn what the error really is. Probably a class file is missing. You should get a NoClassDefFoundError after you do these steps, if that is what the problem really is Configure a server in weblogic workshop and double click on it to get this screen

Server in Workshop:

Click on launch configuration:

Set the JAVA_VENDOR flag to 'Sun' to use the sun JVM or 'BEA' to use the JROCKIT JVM. In our case we set it to 'Sun'.The sun JDK should now give us more information about the mysterious error. If the property JAVA_VENDOR is not available add it.
If you have not configured anything in workshop you can always set the JAVA_VENDOR system property in the setDomainEnv.cmd file located in your domain bin directory using the line SET JAVA_VENDOR=Sun

This should force weblogic to use the sun JVM instead. You can verify this by looking for the startup path from which java.exe is called from your console (the server log console not the web console)

Friday, October 5, 2007

Common code and the open closed policy

Common Application

Often a group of applications write the same code to perform a particular task such as looking up user information or looking up EJB services through a service locator, or executing a query after fetching a connection to the database. These common services can be made available using EJB or POJO objects.
It is often a good idea to write a separate application that handles all the common functionality for you. The application would have to be designed well enough to allow you to work with existing logic without having to change it every time the common application undergoes a change. The common application services can be accessed via EJB, Spring POJO, or simple POJO distributed through JAR files. The JAR file becomes part of the library for other applications to use.

Imagine that you have changed this common application and that the JAR file has to be updated in 20 other applications that use this common application. It would be cumbersome to test each application that uses the common services every time a change is made. Running a series of JUNIT tests is acceptable but manually testing the application every time a change is made is not.

Code that is written using the open-closed principle will not need such exhaustive testing. When you write code always make sure that your existing functionality is wrapped nicely and is robust so that you need not touch it at any point in time. Also reduce the number of methods that you need to invoke in a common library to satisfy a particular service. Whatever your services do internally should not be exposed to the outside world. When you write code to this formula, your code is open for enhancements to it yet closed to change any existing code that works for you. All existing code should be frozen for change and extra features should be add ons to the existing code without changing them in any way.

More on the open closed principle and other patterns/principles in this book, which I would highly recommend - Head first design patterns