One issue — an intranet Oracle “enterprise” web application went down last week as the result of a spike of 350 concurrent users from an average of fewer than 250. Two management meetings.
The first one –
Support Engineer: We’ve had two similar outages earlier this year, after each Oracle told us to tune certain database parameters, but the system still went down the third time last week…
Manager (interrupted): We as employees must take full ownership of any problem, even though sometimes it may seem like a vendor’s issue, ultimately it’s still our responsibility to ensure SLA to our users… How did you do the capacity planning?
Support Engineer: We did the capacity planning at the end of last year…
Manager (interrupted again): If you did the capacity planning, why did the problem still occur?
Support Engineer: Well, there was a spike in the usage — usually there are fewer than 250 concurrent users, but somehow without an apparent reason it shot up to above 350 last week. Theoretically it shouldn’t pose as a threat as our Oracle license allows up to a few thousand concurrent sessions.
Manager: I still don’t understand it — you know, I used to be an engineer myself — why did the problem occur when our Oracle can support a few thousand sessions?
Support Engineer (reluctantly): We don’t either… We are working with Oracle to find a solution. They are probably going to tell us to tune some more parameters.
Manager: How do you know if that’s going to work?
Support Engineer: We will monitor the system closely.
Manager B: Just get 50 guys to punch the keyboard to test.
Support Engineer: <keeps quiet>
Manager: Our Oracle enterprise system is of critical importance in the company. Our users rely on us to provide excellent support and we can’t let them down. We must strive to do our due diligence to perform proper capacity planning and work closely with the vendor to ensure top notch SLA all the time. We have high expectations on you guys and I don’t want to see such problem again at the next month’s meeting.
Support Engineer: Can I raise a related issue?
Manager: Yes?
Support Engineer: We are more than pleased to see an increase in our business users, since that means the company is growing, and we feel really bad when we missed the SLA. However, if the vendor can’t provide a solution for a major performance boost, it means our hardware has reached its capacity limit. And since the business is growing, doesn’t that justifies an investment in better hardware to satisfy the business needs?
Manager: I understand what you are trying to say. In fact, this is what the company forum is for, raise it over there. You should try harder, be persistent, and your voice will be heard — it has happened before. But remember, you must support your request with facts with strong financial justifications. Any other issue? OK, the meeting is over.
**
The second one –
Support Engineer: We’ve had two similar outages earlier this year, after each Oracle told us to tune certain database parameters, but the system still went down the third time last week…
Manager (interrupted): Are they all performance issues?
Support Engineer: Yes, usually there are fewer than 250 concurrent users in our system, but somehow without an apparent reason it shot up to above 350 last week. Even though theoretically it shouldn’t pose as a threat as our Oracle license allows up to a few thousand concurrent sessions.
Manager: Taking vendor’s words seriously can hurt ourselves. If the problem is caused simply by concurrent web sessions, we can easily simulate the load and do the stress test — what tool do you use? I know this simple HTTP benchmarking utility from Apache foundation called ab. <switching to her terminal under Linux and starting to type the command>
ab -c 350 -t 60 http://uat_intranet.mycompany.com/
Manager: Enable your logging, I’m about to fire it up.
<a minute later>
Manager: OK, so it died. What does the log say?
Support Engineer: There isn’t anything definitive in the log…
Manager: Let’s try it again, this time with your profiler hooked up. We’ll see which is the part that cracks under pressure.
<another minute later>
Support Engineer: I think we’ve got something, it’s the database connection function that has stopped responding.
Manager: Good, do we have its source code?
Support Engineer: Unfortunately, no. It’s one of the proprietary functions from Oracle.
Manager: Contact Oracle support, tell them exactly what we’ve done and press them to provide an urgent fix. Then we’ll iterate the stress test over and over until our performance needs are satisfied, preferably even for the spikes.
Support Engineer: OK
Manager: In the mean time, is it possible to throttle the application server to redirect the excessive users to a temporary page with a simple “System under maximum load, please try again after a few minutes” message before the crashing point is reached?
Support Engineer: I’ll look into it.
Manager: BTW, is the hardware capacity adequate to support the projected growth for the next 12 months?
Support Engineer: I don’t think so. Our existing hardware is already under heavy load.
Manager: I can borrow larger irons to do capacity load tests, and we’ll request for new hardware once we have the data. Any other issue?
Support Engineer: None.
Manager: Let’s get going. Keep up the good work.
**
Note that no definitive solution to the problem is arrived at during either meeting.
And did I mention that one of them is a real life meeting that I have gone through today, guess which one?