Friday, November 20, 2009

Representation of Correlation, classification as well as hierarchy into DB

One fine day while I was driving to the office, I show a modified car in front of my car and I started thinking how to represent it into relational model(ERD). From that my thought process started. I was thinking about few things such as:
  • How to represent a car as a vehicle
  • How to represent make of a car
  • How to represent model of a car
  • How to represent car's association to a person using number plate
  • How the modified car fits into the model which I have derived just now etc....
While I was thinking all these things 3 things came in my mind : correlation between entities, classification of entities and representation of entities into hierarchy. Then I was trying to answer following questions about those three things:
1. Should we represent them into ERD or not?
2. If yes then how should they be represented?

My thoughts on those things are as following:
Correlation : It must be represented into ERD by dedicated column(s) into a table which works as cor-relationship id.
Classification:Classification should not be represented into ERD. Reason behind this is classification depends on criteria to be used for classifying things at that point in time. Here moving parts are : fixed set of criteria to be used, validity of criteria to be applied at given point in time, acceptance of results at that point in time for those input parameter to that person etc. So rather then representing classification its always better to capture values of various criteria into ERD driving the classification.
Hierarchy : Hierarchy should be represented into system. But how to represent an hierarchy is a difficult decision. One can simply say that it depends on business; I would say its 100% correct but do you know your current business 100% and can you envision 70% of your business future shifts in terms of hierarchy. Most of the time when business changes; reorganization of existing entities happens. This reorganization is based on our better understanding of current model as well as demand of changes at that point in time. Other point which add complexity into representation of hierarchy is related to graph theory. How do you want it to be represented it? Directional or non-directional and if its directional then which direction is correct?

Wednesday, July 1, 2009

To be careful while serializing an object into BLOB

Sometimes its very difficult to represent some information into relational manner even if its easy to represent into objects. That time you might want to serialize your object into DB. One of the example of what I am taking about is storing expressions into DB. Its very difficult to find tabular representation of expressions e.g. How to represent x*y + (z-10) * (k ^ 3) into DB? But its very easy to represent them into object. In fact this expression is very primarily expression though.

I was working on RulesManagement module and as a part of implementation of the module I have defined certain classes representing various language elements like operators, conditional elements, variable representation etc. As we know rule requires condition to be defined as a part of it, and condition can be represented by expression. I was zeroed down to serialze object into DB. My decision worked well, I got lots of freedom on defining expressions. I was happy and everyday all my testcases were going green
as a part of build process. One day suddenly I got all RED. I was clueless as I have not made any changes in my code since last 3 months. I and my team mates started doing root cause analysis. We spend around 2 hours and the result of this process was bit frustrating. My RuleSet class was extending Persistable class. Persistable class was part of our persistance framework and so we were extending all our object from this class if we want them to be persisted into DB. So I have extended RuleSet class from that one. That day somebody has made changes into Peristable class and so I was getting serial version UID error during deserialization of already persisted rulesets. Its very serious problem as business can never accept persisted rule not being retrieved back.

Lesson learnt:
Object to be serialized into DB should not extend from any class . It should simply implemetat Serializable interface. Also class to be serialized into DB must have default serial verison UID. If you go with generated serial version UID and over the period of time somebody modifies that because of some change in that class, you are screwed.

Friday, May 22, 2009

Experience with using multiple datasources in JEE application

We started with multiple data sources in one of our project. One of the reasons why it was decided to use multiple data sources was to reduce performance impact. It was SOA so we have started assigning a dedicated data source with dedicated authorization id to each service. The very first problem what we encountered here was maintenance of authorization ids. As different data sources were using different authorization id, we had to keep track of what data source uses which authorization id. After having around 8-10 data sources this maintenance became bizarre. We learn from that experience and made all data sources using same authorization id in a given environment. So now we have all data sources using same id in DEV environment and other id into QA environment and all. This model works pretty well with us.

Second problem in using multiple data sources was mysterious dead lock issue . We were using all these data sources for accessing same DB. When multiple data sources are nvolved into single logical unit of work (i.e. logical transaction), each data source starts its own transactional branch. If this transaction with isolation level set to READ_COMMITTED involves multiple tables and some of them are having foreign key relationship then it might happen that one transaction branch insert a record into table x and other transaction branch also tries to insert a record into table y which has foreign key relationship to table x. In this scenario table y will time out if record to be inserted into table y is related to a record inserted into table x. Reason behind this is, record inserted into table x is locked by transaction A so record to be inserted into table y can not read the record from table x for checking referential integrity as its in other transaction and isolation level is set to READ_COMMITTED.

So important points to be considered in using multiple data sources are:
  1. Please try to stick to same authorization id for a given environment, if not required.
  2. Multiple data sources should not be pointing to same database, otherwise it might lead to dead lock issue quite often. Use multiple data sources if its required to point to different databases.
  3. In case of multiple data source and remote EJB call scenario one should use XA data source as it might lead to a XA transaction.

Sunday, February 15, 2009

My study on using stored procedures in JEE applications

Pros:
  1. Good at performance as all operations are happens inside DBMS system itself. Also nowadays most of DBMS system has mechanism of caching query execution plan so it can perform even faster at future calls. Query execution plan remains same during each round of executions of a stored procedure.
  2. Provides another layer of abstraction by providing separation between application and underlying DBMS system.
  3. Reduces network traffic by avoiding repetitive calls to DBMS system.
Cons:
  1. There are very high chances that business logic gets spread at two places. So it’s difficult to manage the business logic.
  2. We need versioning of stored procedures as its part of project deliverable and we can not check out a copy of stored procedure from DBMS system. We are left with two copies, one which is present into DBMS system and the other one which is present at SVN. So it’s hard to maintain latest valid copy of a stored procedure at DBMS system.
  3. Stored procedures are bad at scalability. We can do horizontal as well as vertical scaling of J2EE application server for doing load balancing but we can not do the same for DBMS system as they don’t support clustering. Oracle has some support for clustering. So stored procedure execution can become bottleneck into multi user system with high load.
  4. Using stored procedure means another layer of abstraction and another language to learn.
  5. Hard to debug from the same IDE which is used for development. So its affects productivity.
  6. There are very high chance of java developers writing improper stored procedure like stored procedure with less efficient queries, open cursors, improper error handling, improper state management etc.
  7. Stored procedure language is procedural language. So it’s very difficult to write a complex logic into it. Even if we write it ships with all disadvantages of procedural language. E.g. If we need to do two operations on a Customer then we need to write down two procedures. Lots of code used in these two stored procedures may be same but we can not reuse it. So we might end up in having lots of code duplication.
  8. Reduces testability of application.

My comments:
Each and every person in team should understand clear distinction between data modeling operation and business operation for a given functionality. So business logic should go into service layer and the data related operation if it’s too complex then should go into stored procedures. I believe that if we have very big DBMS schema with large number of records in it, we should keep data related operation (E.g. data formatting, data aggregation etc.) into stored procedures. We should have proper process in place for managing this kind clear distinction from the starting of the project itself otherwise it might become very difficult to control it at later point of time.

Most of the above cons can be overcome by having proper processes in place as well as better tooling support.

Personally I don’t like to use stored procedures, if I have other good options at Java space for given requirements. But if stored procedure is the only good option which can fit well to the given requirements then I don’t mind in doing that keeping above distinctions in mind.

I may be wrong in stating above comment. I would like to correct my perceptions if they are wrong. It will really help me going further. Thanks in advance for the same.

Some quotes:
"If you want portability keep logic out of SQL." - Martin Fowler

"Do not use stored procedures to implement business logic. This should be done in Java business objects." - Rod Johnson

References:
http://codebetter.com/blogs/jeremy.miller/archive/2005/07/06/130094.aspx
http://www.martinfowler.com/articles/dblogic.html
http://thinkoracle.blogspot.com/2006/01/plsql-vs-j2ee.html
http://www.codinghorror.com/blog/archives/000117.html
http://www.theserverside.com/discussions/thread.tss?thread_id=32141