2019-06-26T13:20:12

PostreSQL and Perl bulk inserts & updates speed testing

To evaluate speed of different methods how to insert and update data in PostgreSQL I've create a simple test repository: https://github.com/jozef/pg-n-pl-bulking.

short-story:

Chunk inserts (100 to 200 rows) with PostreSQL >=9.5 feature "ON CONFLICT" turned out to be the most elegant and fast way to insert and update thousands of rows. Here an example SQL to get an idea:

INSERT INTO pg_n_pl_bulking (title,num,meta,ident)
VALUES
    ('title1',1,'{}','id1'),
    ('title4',4,'{}','id4'),
    ('title3',3,'{}','id3'),
    ('title2',2,'{}','id2'),
    …
ON CONFLICT (ident) DO
UPDATE SET title=EXCLUDED.title, num=EXCLUDED.num, meta=EXCLUDED.meta;

To run the tests, see README.pod#database-setup and then simply run make inside the repository check-out. You should see output similar to this one as executed on my ThinkPad T420 4xi5-2520M CPU @ 2.50GHz and Samsung SSD 850 EVO 250GB disk:

pg-n-pl-bulking$ make
prove -l t/01_test-clean-db.t && make test
t/01_test-clean-db.t .. ok
All tests successful.
Files=1, Tests=3,  2 wallclock secs ( 0.04 usr  0.00 sys +  0.16 cusr  0.00 csys =  0.20 CPU)
Result: PASS
make[1]: Entering directory '/home/jozef/prog/pg-n-pl-bulking'
prove -l t/
t/01_test-clean-db.t ................................... ok
t/02_generate-import-data.t ............................ ok
t/03.1_insert-data_standard.t .......................... 5/? # 23000 rows total, speed 0.639k/s
t/03.1_insert-data_standard.t .......................... ok
t/03.2_insert-data_standard_delete.t ................... 5/? # 23000 rows total, speed 0.638k/s
t/03.2_insert-data_standard_delete.t ................... ok
t/03.3_insert-data_on_conflict.t ....................... 5/? # 23000 rows total, speed 0.719k/s
t/03.3_insert-data_on_conflict.t ....................... ok
t/04.1_insert-data_standard_pre_select_chunks.t ........ 5/? # 23000 rows total, speed 0.717k/s
t/04.1_insert-data_standard_pre_select_chunks.t ........ ok
t/04.2_insert-data_standard_pre_select_chunks_async.t .. 5/? # 23000 rows total, speed 0.742k/s
t/04.2_insert-data_standard_pre_select_chunks_async.t .. ok
t/04.3_insert-data_standard_pre_select_chunks_multi.t .. 5/? # 23000 rows total, speed 8.914k/s
t/04.3_insert-data_standard_pre_select_chunks_multi.t .. ok
t/04.4_insert-data_standard_pre_select_chunks_copy.t ... 5/? # 23000 rows total, speed 9.649k/s
t/04.4_insert-data_standard_pre_select_chunks_copy.t ... ok
t/05.1_insert-data_standard_multi_on_conflict.t ........ 5/? # 23000 rows total, speed 11.251k/s
t/05.1_insert-data_standard_multi_on_conflict.t ........ ok
t/05.2_insert-data_standard_multi_on_conflict_async.t .. 5/? # 23000 rows total, speed 11.534k/s
t/05.2_insert-data_standard_multi_on_conflict_async.t .. ok
All tests successful.
Files=11, Tests=69, 191 wallclock secs ( 0.08 usr  0.02 sys + 20.34 cusr  9.04 csys = 29.48 CPU)
Result: PASS

long-story:

These speed tests needs one database and will create one table in it that will be populated with generated data. Connection to that database with table create right must be specified in Postgres service file. An example of ~/.pg_service.conf file:

[pg_n_pl_bulking_db]
host=localhost
port=5432
user=nobody
dbname=pg_n_pl_bulking_db
__SERVICE__

executing PGSERVICE=pg_n_pl_bulking_db psql from command-line has to work as this is the way test files are configured to connect to the database.

Here the file structure with description:

├── Makefile
├── sql
│   └── ddl.sql
├── lib
│   └── testlib.pm
├── t
│   ├── 01_test-clean-db.t
│   ├── 02_generate-import-data.t
│   ├── 03.1_insert-data_standard.t
│   ├── 03.2_insert-data_standard_delete.t
│   ├── 03.3_insert-data_on_conflict.t
│   ├── 04.1_insert-data_standard_pre_select_chunks.t
│   ├── 04.2_insert-data_standard_pre_select_chunks_async.t
│   ├── 04.3_insert-data_standard_pre_select_chunks_multi.t
│   ├── 04.4_insert-data_standard_pre_select_chunks_copy.t
│   ├── 05.1_insert-data_standard_multi_on_conflict.t
│   └── 05.2_insert-data_standard_multi_on_conflict_async.t
└── tmp
    └── .exists

Makefile: support all,test and clean targets. Executing make will trigger DB connection test and execution of all tests.

sql/ddl.sql: is test table definition. Table pg_n_pl_bulking has a serial id, one uniq text field, one text field, one number field, one jsonb field and two datetime fields - one for creation timestamp and one updated by trigger on row update statements.

lib/testlib.pm: is shared code used by all tests so they can be minimalistic, mostly only containing the actual insert/update logic. testlib->insert_or_update_row( sub { __code_to_insert_update__ } ); will execute the callback function on each row that is suppose to be inserted/updated into the database. The rest are helper function to orchestrate test data generation&reading, database helpers and configurations.

t/01_test-clean-db.t: tests DB connection, drops odl pg_n_pl_bulking table and creates new one.

t/02_generate-import-data.t: will generate once all test data files defined in testlib::data_files() and store them as tmp/*.tsv.

Next the speed tests. note that the "speed" in thousands per second is only to be considered in relation to one or another method. The actual one at the end will depend on the database table, data, hosts and network characteristics.

t/03.*.t: different ways to insert/update data per one row.:

# speed 0.639k/s
t/03.1_insert-data_standard.t:
    UPDATE pg_n_pl_bulking SET title=?,num=?,meta=? WHERE ident=?;
    INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES (?,?,?,?);
        -- ^^^ if update returns 0 updated rows
-- this test will also generate database dump in tmp/test_dump-pg_n_pl_bulking.tsv
--    to verify if the other methods are correctly inserting/updating the data

# speed 0.638k/s
t/03.2_insert-data_standard_delete.t
    DELETE FROM pg_n_pl_bulking WHERE ident=?;
    INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES (?,?,?,?);

# speed 0.719k/s
t/03.3_insert-data_on_conflict.t
    INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES (?,?,?,?);
    ON CONFLICT (ident) DO UPDATE
        SET title=EXCLUDED.title, num=EXCLUDED.num, meta=EXCLUDED.meta;

t/04.*.t: select/update/insert in chunks of rows:

# speed 0.717k/s
t/04.1_insert-data_standard_pre_select_chunks.t:
    SELECT ident FROM pg_n_pl_bulking WHERE ident IN (?,?,?,?,?,?,?,?,?,…)
    -- then
        UPDATE pg_n_pl_bulking SET title=?,num=?,meta=? WHERE ident=?
    -- or
        INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES (?,?,?,?)
    -- depending whether found using select or not

# speed 0.742k/s
t/04.2_insert-data_standard_pre_select_chunks_async.t
    -- same as above but using PG_ASYNC + PG_OLDQUERY_WAIT options

# speed 8.914k/s
t/04.3_insert-data_standard_pre_select_chunks_multi.t
    SELECT ident FROM pg_n_pl_bulking WHERE ident IN (?,?,?,?,?,?,?,?,?,…)
    -- then
        UPDATE pg_n_pl_bulking SET
            title   = data_table.title,
            num     = data_table.num,
            meta    = data_table.meta
        FROM
            unnest(?::text[],?::text[],?::integer[],?::jsonb[])
                AS data_table(ident,title,num,meta)
        WHERE
            pg_n_pl_bulking.ident = data_table.ident;
    -- or
        INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES '
            (?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),
            (?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),
            …
    -- depending whether found using select or not

# speed 9.649k/s
t/04.4_insert-data_standard_pre_select_chunks_copy.t
    SELECT ident FROM pg_n_pl_bulking WHERE ident IN (?,?,?,?,?,?,?,?,?,…)
    -- then
        UPDATE pg_n_pl_bulking SET
            title   = data_table.title,
            num     = data_table.num,
            meta    = data_table.meta
        FROM
            unnest(?::text[],?::text[],?::integer[],?::jsonb[])
                AS data_table(ident,title,num,meta)
        WHERE
            pg_n_pl_bulking.ident = data_table.ident;
    -- or
        COPY pg_n_pl_bulking (title,num,meta,ident) FROM STDIN;
        title1  num1    meta1   ident1
        title2  num2    meta2   ident2
        title3  num3    meta3   ident3
        …
    -- depending whether found using select or not

t/05.*.t: using insert into with multiple values and on conflict do:

# speed 11.251k/s
t/05.1_insert-data_standard_multi_on_conflict.t
    INSERT INTO pg_n_pl_bulking (title,num,meta,ident) VALUES
        (?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),
        (?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),(?,?,?,?),
        …
        ON CONFLICT (ident) DO
            UPDATE SET title=EXCLUDED.title, num=EXCLUDED.num, meta=EXCLUDED.meta

# speed 11.534k/s
t/05.2_insert-data_standard_multi_on_conflict_async.t
    -- same as above but using PG_ASYNC + PG_OLDQUERY_WAIT options
    

round-up:

If speed matters, data has to be processed in chunks. Single command overhead is simply too big for bulk inserts. On Postgres version >=9.5 using ON CONFLICT feature gives the best results. In case of pure inserts \copy commands would be the fastest way to put data in, but in case of possible conflicts, these has to be handled beforehand.The Postgres async options did not yell too much of a performance gain as there is no much extra work, for processing input, that could be possibly done while the database is executing queries. In case of heavier pre-processing input, as reading from tsv file on the dist in this test case, the async could improve speed much better.


more samples

teste with DB connection over network and 13+mio rows of data:

pg-n-pl-bulking[master]$ make clean test
rm -rf tmp/*.tsv tmp/*.tsv_from-last-test
prove -l t/
t/01_test-clean-db.t ................................... ok
t/02_generate-import-data.t ............................ ok
t/03.1_insert-data_standard.t .......................... 10/? # 13123000 rows total, speed 0.962k/s
t/03.1_insert-data_standard.t .......................... ok
t/03.2_insert-data_standard_delete.t ................... 10/? # 13123000 rows total, speed 0.549k/s
t/03.2_insert-data_standard_delete.t ................... ok
t/03.3_insert-data_on_conflict.t ....................... 10/? # 13123000 rows total, speed 0.973k/s
t/03.3_insert-data_on_conflict.t ....................... ok
t/04.1_insert-data_standard_pre_select_chunks.t ........ 10/? # 13123000 rows total, speed 1.061k/s
t/04.1_insert-data_standard_pre_select_chunks.t ........ ok
t/04.2_insert-data_standard_pre_select_chunks_async.t .. 10/? # 13123000 rows total, speed 1.111k/s
t/04.2_insert-data_standard_pre_select_chunks_async.t .. ok
t/04.3_insert-data_standard_pre_select_chunks_multi.t .. 10/? # 13123000 rows total, speed 6.877k/s
t/04.3_insert-data_standard_pre_select_chunks_multi.t .. ok
t/04.4_insert-data_standard_pre_select_chunks_copy.t ... 10/? # 13123000 rows total, speed 6.798k/s
t/04.4_insert-data_standard_pre_select_chunks_copy.t ... ok
t/05.1_insert-data_standard_multi_on_conflict.t ........ 10/? # 13123000 rows total, speed 7.898k/s
t/05.1_insert-data_standard_multi_on_conflict.t ........ ok
t/05.2_insert-data_standard_multi_on_conflict_async.t .. 10/? # 13123000 rows total, speed 7.898k/s
t/05.2_insert-data_standard_multi_on_conflict_async.t .. ok

pg_n_pl_bulking_db=# select count(*) from pg_n_pl_bulking;
 count
--------
 999999
(1 row)

pg_n_pl_bulking_db=# \dt+ pg_n_pl_bulking
                        List of relations
 Schema |      Name       | Type  | Owner  |  Size  | Description
--------+-----------------+-------+--------+--------+-------------
 public | pg_n_pl_bulking | table | nobody | 989 MB |
(1 row)