pg_bulkload is a high speed data loading utility for PostgreSQL.
Here are comparisons between COPY and pg_bulkload. Performance was measured with basic-tuned PostgreSQL server.
There are the following measurement patterns.
Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY. In PARALLEL mode, performance will be improved on multi-CPU server because reading an input file and writing rows to a table are done with two processes.
COPY also can be more fast by loading into table without indexes and create indexes after it.
Item | Duration | Duration comparison |
---|---|---|
COPY with indexes | 500 sec | - |
COPY without indexes + CREATE INDEX |
333 sec (229 sec + 51 sec+ 53 sec) |
66.7 % |
pg_bulkload (DIRECT) with indexes |
334 sec | 66.8 % |
pg_bulkload (PARALLEL) with indexes |
221 sec | 44.2 % |
Pg_bulkload on WRITER = PARALLEL mode can load data with an almost half time against COPY in this case also. COPY without indexes is not faster than COPY with indexes. because it has to create indexes for total records of the table from initial.
Item | Duration | Duration comparison |
---|---|---|
COPY with indexes | 140 sec | - |
COPY without indexes + CREATE INDEX |
187 sec (62 sec + 60 sec + 65 sec) |
133.6 % |
pg_bulkload (DIRECT) with indexes |
93 sec | 66.4 % |
pg_bulkload (PARALLEL) with indexes |
70 sec | 50.0 % |
The maintenece_work_mem, PostgreSQL parameter, affects the performance of pg_bulkload. The duration becomes almost 15 % shorter if this parameter is changed from 64 MB to 1 GB.
FILTER feature transforms input data in various operations, but it's not free. The actual measurement shows the loading time is increased to almost 240 % with SQL functions and almost 140 % with C functions.
項目 | 時間 | ||
---|---|---|---|
Initial (4GB) | Appended (1GB) | ||
MWM = 64 MB | DIRECT | 397 sec | 109 sec |
MWM = 1 GB | |||
DIRECT | 334 sec | 93 sec | |
DIRECT with SQL FILTER | 801 sec | 216 sec | |
DIRECT with C FILTER | 456 sec | 126 sec |
Item | Value |
---|---|
Server | Dell PowerEdge R410 |
CPU | Intel Xeon E5645 (2.4GHz) 12 core * 2 |
Memory | 32GB |
Disks | SAS 10000rpm 2TB * 4 |
OS | CentOS 6.2 (64bit) |
PostgreSQL version | 9.3.4 |
pg_bulkload version | 3.1.6 |
shared_buffers | 3210MB |
checkpoint_segments | 300 |
checkpoint_timeout | 15min |
work_mem | 1MB |
maintenance_work_mem | 1GB |
Table definition | DBT-2 customer table |
Indexed columns | c_id (PRIMARY KEY) |
c_d_id (non-unique B-Tree) | |
Constraints | NOT NULL for all columns |
Input file format | CSV |