Japanese


Welcome to the pg_bulkload Project Home Page


pg_bulkload is a high speed data loading utility for PostgreSQL.

2015.07.03 The project is moved to GitHub.

Please follow latest information from http://github.com/ossc-db/pg_bulkload

日本語ページはこちら


Documentation


Performance Results

Here are comparisons between COPY and pg_bulkload. Performance was measured with basic-tuned PostgreSQL server.

Table definition
Customer table in DBT-2 benchmark, that is an implementation of TPC-C.
Index definition
There are 2 indexes. The first one is a primary key with one ascending integer column. The second one is a non-unique index with one random integer column.

There are the following measurement patterns.

  1. Initial data loading to an empty table for 4GB of data
  2. Appended data loading to a table with 4GB of data for 1GB of new data
  3. Performance efficiencies by maintenance_work_mem and FILTER features

Result 1: Initial data loading

Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY. In PARALLEL mode, performance will be improved on multi-CPU server because reading an input file and writing rows to a table are done with two processes.

COPY also can be more fast by loading into table without indexes and create indexes after it.

Initial Load (4GB)
Item Duration Duration comparison
COPY with indexes 500 sec -
COPY without indexes
+ CREATE INDEX
333 sec
(229 sec + 51 sec+ 53 sec)
66.7 %
pg_bulkload (DIRECT)
with indexes
334 sec 66.8 %
pg_bulkload (PARALLEL)
with indexes
221 sec 44.2 %

Result 2: Appended data loading

Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY in this case also. COPY without indexes is not faster than COPY with indexes. because it has to create indexes for total records of the table from initial.

Appended Load (1GB)
Item Duration Duration comparison
COPY with indexes 140 sec -
COPY without indexes
+ CREATE INDEX
187 sec
(62 sec + 60 sec + 65 sec)
133.6 %
pg_bulkload (DIRECT)
with indexes
93 sec 66.4 %
pg_bulkload (PARALLEL)
with indexes
70 sec 50.0 %

Result 3: Influence from parameters and features

The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15 % shorter if this parameter is changed from 64 MB to 1 GB.

FILTER feature transforms input data in various operations, but it's not free. The actual measurement shows the loading time is increased to almost 240 % with SQL functions and almost 140 % with C functions.

pg_bulkload features
項目 時間
Initial (4GB) Appended (1GB)
MWM = 64 MB DIRECT 397 sec 109 sec
MWM = 1 GB
DIRECT 334 sec 93 sec
DIRECT with SQL FILTER 801 sec 216 sec
DIRECT with C FILTER 456 sec 126 sec

Conditions

ItemValue
ServerDell PowerEdge R410
CPUIntel Xeon E5645 (2.4GHz) 12 core * 2
Memory32GB
DisksSAS 10000rpm 2TB * 4
OSCentOS 6.2 (64bit)
PostgreSQL version9.3.4
pg_bulkload version3.1.6
shared_buffers3210MB
checkpoint_segments300
checkpoint_timeout15min
work_mem1MB
maintenance_work_mem1GB
Table definitionDBT-2 customer table
Indexed columnsc_id (PRIMARY KEY)
c_d_id (non-unique B-Tree)
ConstraintsNOT NULL for all columns
Input file formatCSV