db2 list tablespaces show detail Python Script

January 27th, 2009

A short time ago I needed to understand how much free space I had on each table space of a DB2 database, and I did not have access to any management tools that would have made my task easier. My only option was to run db2 list tablespaces show detail at the command line, but the output of the command cannot be imported into Excel easily. The solution I settled on was to modify a Python script originally created to read ini files. The script is listed at the bottom of this message. To use it, pipe the output of list tablespaces show detail into a file. Then edit the script to read from the file. The script will output a tab-delimited list of table spaces which can be copied and pasted into Excel.

import sys
import re

SECTION = re.compile('^\s*Tablespace ID\s*=\s*([0-9]+)\s*$')
PARAM = re.compile('^\s*(.*)\s*=\s*(.*)\s*$')
COMMENT = re.compile('^\s*;.*$')

d = {}
f = open('c:\\list_tablespaces_output.txt')
for line in f:
if COMMENT.match(line): continue
m = SECTION.match(line)
if m:
section, = m.groups()
d[section.strip()] = {}
m = PARAM.match(line)
if m:
key, val = m.groups()
d[section.strip()][key.strip()] = val.strip()

f.close()

print "Name\tUsed pages\tTotal pages\tUsed pages\tFree pages\tHigh water mark (pages)\tPage size (bytes)\t% Free"
for k, v in d.items():
#print k, v
if v["Free pages"] != "Not applicable":
pctfree = float(v["Free pages"])/float(v["Total pages"])
else:
pctfree = 0

print "{0}\t{1}\t{2}\t{3}\t{4}\t{5}\t{6}\t{7}".format(v["Name"]
,v["Used pages"]
,v["Total pages"]
,v["Used pages"]
,v["Free pages"]
,v["High water mark (pages)"]
,v["Page size (bytes)"]
,pctfree)

MySQL vs. PostgreSQL for Data Warehouses that use Views

November 7th, 2006

As part of the Netezza replacement project I’ve been working on, I ran a quick evaluation of MySQL 5.x. before testing PostgreSQL. I didn’t have time to conduct a thorough evaluation, but I quickly found out that MySQL does not handle queries ran against views very well. I ran some tests by creating simple views against some very large tables (40+ million rows). Running a query against the base table was many times faster than running the same query against a view which did not even contain a WHERE clause. I’ll need to do some additional research, but it seems that the implementation of views in MySQL lags far behind the PostgreSQL implementation. It’s obviously not very surprising, since support for views was added to MySQL only in version 5.0.

PostgreSQL Performance Checklist

November 7th, 2006

I found some very valuable tips in the web site of an unpublished book: Power PostgreSQL by Josh Berkus and Joe Conway. If you are working with large data sets and complex queries, pay special attention to the section that discusses work_mem.

Tuning PostgreSQL for Data Warehousing

October 26th, 2006

I’m currently working in a project that will replace a Netezza database with PostgreSQL 8.1. Early test results have been very encouraging, even though I haven’t found much information online about how to best tune PostgreSQL when dealing with complex queries running against very large tables. I will share what I learned in future posts.

Tuning Methodology

February 24th, 2006
  1. The first step is to identify the root causes of the problem. The IT team may claim that the reporting software does not work well due to the complexity of the database schema, while the business users may claim that even simple reports take forever to run. Regardless of what both groups say, you must gather hard-data so that the real issues reveal themselves. The first question you must ask the IT team is: are you capturing DW usage statistics? If usage statistics are not available, skip to the next step. If usage is being tracked, you should be able to answer the following questions:
    1. Who are the top report users?
    2. Who are the top report creators?
    3. Which users generate the most errors?
    4. What are the most common errors?
    5. What are the most popular reports?
    6. Which reports take the longest to run?
    7. Which database tables are used most frequently?
    8. What is the row count of the most frequently used tables?
    9. What table columns are used the most?

Performance Tuning Guide

February 23rd, 2006

I’m using this blog to create a performance tuning guide for data warehousing professionals. The guide will document the steps I follow to diagnose and solve what my clients perceive to be performance problems.

Let me first set a typical scenario that I have found during my consulting career. After several years of development, a large corporation has finally managed to implement a corporate-wide data warehouse (DW). A dedicated IT development group has built and deployed a reporting solution on top of the DW. Unfortunately, the users of the DW, typically on the business side of the corporation, are very unhappy with the reporting environment. They claim that performance is extremely slow. Reports take forever to run. Many reports return cryptic errors that only IT can decipher. The IT group is bombarded with support requests, and must allocate most of its time to answering support calls. Furthermore, since the corporation’s business is changing constantly, business users demand that the DW evolve to reflect how the business works. However, since IT spends most of its time fixing problems with the existing reporting environment, it cannot implement new features in the DW.

The scenario described above creates a dynamic whereby business users grow increasingly disillusioned with the DW and the IT group in charge of it, while the IT team grows frustrated with the business users and their insatiable demand for support and feature improvements. Both groups become frustrated with the reporting software, which is the most visible part of the DW.

When the situation becomes critical enough, the IT team decides to bring an external consultant to “fix the problem”.

In my next post I will list the high-level steps that a consultant must follow to solve the problem.

DB2 Configuration Parameters for Data Warehousing

June 1st, 2005

An essential part of the performance tuning process is verifying that your database is configured properly. Scott Hayes and Philip Gunning provide detailed tips for setting DB2 configuration parameters in Tuning Up for OLTP and Data Warehousing.

Measure, Improve, Repeat

May 24th, 2005

Scott Hayes wrote Measure, Improve, Repeat, an article that covers some practical advice on tuning DB2. The article does not focus on data warehouse applications, but some of the tips are still valid. I found his tip about getting all performance data with just one command particularly useful:

$ db2 “get snapshot for all on DBNAME” > allsnap.txt

Teradata data compression

May 24th, 2005

One of the interesting ways performance can be improved in Teradata is through data compression. The linked article mentions some of the details behind data compression.

Aggregate Tables

May 20th, 2005

I’ll be gathering all my notes on aggregate tables in this post.

What are aggregate tables?

Aggregate tables, also know as summary tables, are fact tables which contain data that has been summarized up to a different level of detail. For example, let’s say that your data warehouse contains a transaction table with the following characteristics (I’ll use a banking example):

Table dimensionality: account id, transaction type, day id, transaction amount
Average number of transactions per day: 30 million
Number of days stored in the transaction table: 30
Approximate number of rows: 900 million rows

Let’s pretend that half of the daily transactions are deposits, so there are approximately 450 million rows that represent deposit transactions. The other half are withdrawals.

Suppose a DW user wants to know how much money was deposited into the bank during the past month. The user, through the reporting software, will issue a query similar to:

select sum(transaction_amount)
from transaction_fact
where transaction_type=’deposit’

Pretend that your DW platform can scan 10 million rows per second; therefore, the approximate time to complete the query will be:

query time = number of rows / scan rate

which in our example translates into:

query time = 900 million rows / 10 million rows/second

query time = 90 seconds

Waiting 90 seconds for such a simple query is simply unacceptable, so here is where an aggregate table can help you. To answer our hypothetical question, we will build an aggregate table which summarizes the transaction table by transaction type. The aggregate may be defined as follows:

create table fact_transaction_aggregate as
select day_id, transaction_type, sum(transaction_amount) as transaction_amount
from transaction_fact
group by day_id, transaction_type

We said before that there are only two transaction types and thirty days of data. Using the simplifying assumption that half of the daily transactions are deposits and half are withdrawals, the size of the new table will be only 60 rows! (table size = 30 days * 2 transaction types)

The SQL needed to get the answer is:

select sum(transaction_amount)
from transaction_aggregate
where transaction_type=’deposit’

And the answer will come back in 0.000006 seconds (60 rows / 10 million rows /second). The result: happy users!