Replicate multiple PostgreSQL servers to a single MongoDB server using logical decoding output plugin

1. Overview

“Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance.” This is the highlight of PostgreSQL in a sentence from Wikipedia. Yes, the extensibility of PostgreSQL is extremely useful when you have some special requirements. This blog will discuss how to use logical decoding output plugin to replicate multiple PostgreSQL servers to a single MongoDB server and the environment setup using docker compose for a quick Proof of Concept.

2. Logical decoding output plugin

Start from version 9.5, PostgreSQL provides a solution to allow users to write their own decoding logic to reformat the output before sending it to the subscriber through a wal sender. This feature is called logical decoding output plugin. With this feature, other applications can get a specific data format from PostgreSQL database easily, and then use their own existing tools or algorithms to continue the data analysis. For example, replicating multiple Postgres servers to a single MongoDB server and then perform the real-time analytics and data visualization in MongoDB. There are many different decoding plugins available at here, however wal2mongo is only one can generate a JavaScript format data that can be used by MongoDB directly. You can use it following this blog to practice how data replication can be done from multiple Postgres databases to a MongoDB database using logical decoding output plugin.

Logical decoding output plugin framework

The logical decoding output plugin framework has a few predefined callback interfaces. These callbacks are registered in an initialization function named _PG_output_plugin_init, which will be called when the plugin is loaded as a shared library. After hookup with these interfaces by referring to the example test_decoding, the output decoding plugin will get notified about the changes that is happening via various callback interfaces. In these callback interfaces, the one will get noticed each time when an insert, update or delete happens is LogicalDecodeChangeCB. Most of the output formatting related logic should be done in a function which is registered to this interface, such as, map a Postgres data type to a MongoDB native data type.
Before writing your own logical decoding plugin, two basic things need to be mentioned here: one is the change callback interface is triggered based on each row, i.e. if you have multiple rows changed in one transaction, then you will get multiple times call back; the other is that the callback is triggered only by the row changes that have been safely saved to WAL files. The changes that were rolled back by the transaction will not trigger the callback.

wal2mongo decoding plugin

wal2mongo is a logical decoding plugin, it mainly converts the DML operation such as insert, update and delete into a JavaScript format which later can be imported into MongoDB by using mongo client.

Logical decoding data changes flow

The diagram above shows the data changes flow. We can use psql client to simulate the Application and perform database operation such as create table, insert records, update records and then delete records. The Postgres server backend will save the database changes to WAL first and then notify the WAL sender. The WAL sender will read the changes from WAL and get the output decoded by wal2mongo and then send the changes to the connected subscriber, i.e pg_recvlogical in this blog. pg_recvlogical will save the output as a JavaScript file, then later it will be imported to MongoDB using mongo client.

To exercise this, we need to change wal_level to logical and make sure max_wal_senders is at least 2, then restart Postgres server after the changes.

Example of wal2mongo output

wal2mongo is designed to replicate the data changes from Postgres to MongoDB, not like a Postgres standby which need to replicate everything. Therefore, the focus is on the mostly used DML operation such as INSERT, UPDATE and DELETE. The diagram below shows the steps inclduing create a table, insert some records, do some update, and then perform a delete and how the data output looks like. Here, you have to change the table replica identity to full, otherwise, you won’t be able to replicate the UPDATE correctly.

2. Automatic the environment setup using docker compose

Setup an environment like the one mentioned above to test a logical output plugin may take some time, especially when a user just wants to give a quickly try. I will introduce an easy way using docker to do the work. A Docker environment can be very useful when you want to practice a new technology which requires a complicated setup, or multiple machines to be available. It also help when you need such an environment from time to time, especially when you need it to be available “immediately” and then discard it after you finished your work.
To achieve such an environment with multiple Postgres servers and one MongoDB server, I built two docker images: one is a dedicated postgres with wal2mongo installed, and the other one is a dedicated mongod with the pg_recvlogical tools installed. Here is a cluster environment running on a sinlge physical machine using the files available at here.

Dockerfile for postgres
This Dockerfile is to build a docker image with the latest PostgreSQL 12.2 offical release and the wal2mongo installed.

FROM centos:centos7
MAINTAINER The CentOS Project <cloud-ops@centos.org>

RUN yum -y install https://download.postgresql.org/pub/repos/yum/reporpms/EL-7-x86_64/pgdg-redhat-repo-latest.noarch.rpm
RUN yum -y update; yum -y install sudo epel-release yum-utils net-tools which \
		postgresql12-server postgresql12 \
		git gcc make clang zlib-devel readline-devel postgresql12-devel; yum clean all

RUN usermod -aG wheel postgres
RUN echo "export PATH=/usr/pgsql-12/bin:$PATH" | tee -a /var/lib/pgsql/.bashrc
RUN echo "postgres ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/postgres && chmod 0440 /etc/sudoers.d/postgres

USER postgres
RUN eval "sudo -E env "PATH=$PATH" USE_PGXS=1 make CLANG=/usr/bin/clang with_llvm=no install"

VOLUME  ["/var/log/pgsql", "/var/lib/pgsql"]
EXPOSE 5432

COPY run.sh .
CMD ["/run.sh"]

What inside the run.sh script is just a simple postgres start command. The way I used it here is to avoid rebuild the image when you want to start the container in different ways.

1 2	#!/bin/bash /usr/pgsql-12/bin/pg_ctl -D /var/lib/pgsql/12/data -l /var/log/pgsql/logfile start

Dockfile for mongod
The dockerfile for mongod has the offical MongoDB 4.25 release and pg_recvlogical 12.2 installed.

FROM centos:centos7
MAINTAINER The CentOS Project <cloud-ops@centos.org>

COPY mongodb.repo /etc/yum.repos.d/
RUN yum -y install mongodb-org-4.2.5 mongodb-org-server-4.2.5 	\
		mongodb-org-shell-4.2.5 mongodb-org-mongos-4.2.5 		\
		mongodb-org-tools-4.2.5 postgresql12-contrib; yum clean all
RUN mkdir -p /data/db

RUN usermod -aG wheel mongod
RUN echo "mongod ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/mongod && \
		chmod 0440 /etc/sudoers.d/mongod
RUN echo "export PATH=/usr/pgsql-12/bin:$PATH" | tee -a /root/.bashrc
VOLUME  ["/data/db", "/var/log/mongodb"]

EXPOSE 27017
ENTRYPOINT ["/usr/bin/mongod", "--bind_ip_all"]

Service compose file
The service docker compose file is to help setup two Postgres servers and one MongodDB servere, and make sure they are be able to communicate using hostname.

version: '3.0'

services:
  mongo:
    restart: always
    image: "mongod:4.2.5"
    ports:
      - "27017:27017"
    volumes:
      - ./data/mgdata:/data/db
      - ./scripts/p2m.sh:/p2m.sh

  pg1:
    restart: always
    image: "postgres:12.2"
    ports:
      - "5432"
    volumes:
      - ./data/pg1data:/var/lib/pgsql/12/data
      - ./scripts/data_gen.sh:/data_gen.sh

  pg2:
    restart: always
    image: "postgres:12.2"
    ports:
      - "5432"
    volumes:
      - ./data/pg2data:/var/lib/pgsql/12/data
      - ./scripts/data_gen.sh:/data_gen.sh

You can easily extend the Postgres servers by adding more instances to this service docker compose file.

To build the containers, run command docker-compose -f service-compose.yml up -d. After a while, you should see two Postgres servers and one MongoDB server are running like below.

$ docker ps
CONTAINER ID        IMAGE                 COMMAND                  CREATED              STATUS              PORTS                      NAMES
935a3d339e71        mongod:4.2.5          "/usr/bin/mongod --b…"   About a minute ago   Up About a minute   0.0.0.0:27017->27017/tcp   pg2mongo_mongo_1
d9497be07ce5        postgres:12.2         "/run.sh"                About a minute ago   Up About a minute   0.0.0.0:54322->5432/tcp    pg2mongo_pg2_1
50eb555b5719        postgres:12.2         "/run.sh"                About a minute ago   Up 38 seconds       0.0.0.0:54321->5432/tcp    pg2mongo_pg1_1

To setup the logical replication slot on each postgres server, and connect to each slot using pg_recvlogical then pipe the data changes to mongo client and feed the changes to MongoDB automaticlly, run a command like this docker exec -it pg2mongo_mongo_1 bash /p2m.sh 1 2. Where the p2m.sh is a simple demo script to allow you using pg_recvlogical to create slot, and connect to the slot then dump the message to a internal pipe. Then the mongo client will read the pipe and import the changes to MongoDB. You can check it by log into the mongod container and type a ps -ef to see the running services.

Generate database changes
To simulate the data changes, you can use pgbench. There are two ways to do it: one is to log into each postgres instance and type the pgbench command; the other way is to map a local script to all postgres instances, then run the script from your host machine. For example docker exec -it pg2mongo_pg1_1 bash /data_gen.sh
Verify the data changes
To verify the data changes, you can log into the mongod container, and then use the mongo cient to check the data changes imported automatically. For example,
1
2
3
4
> use mycluster_postgres_w2m_slot2;
switched to db mycluster_postgres_w2m_slot2
> db.pgbench_accounts.count();
10028

Modify the output plugin and reinstall it
You can log into any postgres server and find out the source code and make some changes, then recompile, install it and run a regression test, etc. For example,

docker exec -it pg2mongo_pg1_1 bash
bash-4.2$ cd /var/lib/pgsql/sources/wal2mongo/
...
bash-4.2$ USE_PGXS=1 make clean CLANG=/usr/bin/clang with_llvm=no
...
bash-4.2$ USE_PGXS=1 make CLANG=/usr/bin/clang with_llvm=no
...
bash-4.2$ USE_PGXS=1 make installcheck-force  CLANG=/usr/bin/clang with_llvm=no
...
bash-4.2$ sudo -E env "PATH=$PATH" USE_PGXS=1 make CLANG=/usr/bin/clang with_llvm=no install
...

3. logical decoding output plugin limitations

The logical decoding output plugin is pretty useful feature, but still has some limitations. For example,

Only physical relation tables can trigger logical decoding callback, views and sequences cannot trigger
The tables must have primary key and replica identity need to be set properly
Database schemas are not be able to be replicated
No large objects can be replicated
…

4. Summary

In this blog, we discussed how to use the logical decoding output plugin to replicate multiple PostgreSQL servers to a single MongoDB server, and provide a docker environment setup to perform a quick Proof Of Concept.