Main

Main.XeniaReplicationr1.4 - 01 Mar 2025 - 17:53 - JeremyCothrantopic end

This twiki page discusses the details of replicating a Xenia database for purposes of archiving/analysis of data between Xenia instances.

I considered table partitioning and replication solutions like slony but these didn't do quite what I was looking for and have their own related set of maintenance issues.

The below scheme is fairly generalized and is mainly specific to the XeniaPackageV2 and 'extra' table schema in that they have the system timestamp row_entry_date and row_update_date columns used to determine which rows should be selected, copied or deleted according to a given time range.

This replication scheme could be adapted to any table schema (the tables, dependency order and columns are spelled out in copy_tables.xml ) which has a row_entry_date type column present on each of the tables of interest.

The copy files created are in .csv format and could be shared via http access to others as interested. Versions of Postgres 8.1 and greater include a 'copy with csv header' option which can be included to include a header line in the csv copy file. For this example the copy_tables.xml file will serve to describe the columns of each of the csv files.

Creating additional Xenia instances
General discussion
Getting started
- create
- copy
- delete
Further notes
Setting up repeated automatic replication with cron
Final steps

Creating additional Xenia instances

As the postgres user, create the new target database instance (db_xenia_v2_archive in the below example)

createdb db_xenia_v2_archive

Use the 'pg_dump' function to dump the existing source schema only and run this schema and associated functions against the new target database.

pg_dump -U postgres db_xenia_v2 -s > archive.sql
psql -U postgres -d db_xenia_v2_archive -f archive.sql

The below step assumes that you need to also copy the source dbPostGIS specific spatial_ref_sys and geometry_columns associated row populations into your target db. If not using PostGIS^?, then skip the following step.

pg_dump -U postgres db_xenia_v2 -t spatial_ref_sys -a > spatial.sql
psql -U postgres -d db_xenia_v2_archive -f spatial.sql
pg_dump -U postgres db_xenia_v2 -t geometry_columns -a > geometry.sql
psql -U postgres -d db_xenia_v2_archive -f geometry.sql

General discussion

The way the row replication process works is that it uses the row_entry_date field present on all tables of interest which contains the server time when the row was entered into the database as part of the query selection process on the source database to determine which rows should be sent to a copy file used for populating the target database.

The main perl script used to run the copy process is process_copy.pl which uses the server environment details located in environment_copy.xml (database connection data, pathing) in conjunction with the table information copy_tables.xml (which tables to copy, in their dependency order and column information) to

'create' copy files from the source database
'copy' copy files to a target database
'delete' rows from the source database

The general format for running the command is

perl process_copy.pl environment_copy.xml <create,copy,delete> 2007_01_01_00-05 2007_02_01_00-05
#my ($env_path,$mode,$time_start,$time_stop) = @ARGV;

The last two parameters are the time_start and time_stop which are in the format YYYY_MM_DD_HH and suffixed with the timezone. Note that the time resolution is at the hour level. The query selection is data >= start_time and data < stop_time.

The copy_tables.xml file is currently configured to take into account all Xenia tables/columns in their dependency order for Xenia version 2 and should not require any editing unless there are additions/edits to the default tables/columns.

The process_copy.pl script requires the perl XML::XPath package for processing the xml configuration files if that package is not already installed.

Getting started

The first step is to configure the environment_copy.xml file to match your database connection and server path settings.

For both source and target database you'll want to provide arguments for database name, username and host. You can provide database password also if you're using passwords, but you'll need to edit process_copy.pl to add the password parameter to the psql calls

Under the 'path' section, add the path to

your psql command
your copy_tables.xml file

The following 3 paths after that relate to whether your source and target database reside on the same server or separate servers. If both your source and target database are on the same server, then these can all be the same directory path

dir_tmp_cpy_local - where the source database should create the copy files
dir_tmp_cpy_host - where the target database looks for the copy files to load
dir_cpy - where the copy files are finally gzipped(.gz) and stored

create

If you have your final copy folder and optionally temporary copy folder created with permissions set so that the process can write to them, then you should now be able to run something like the following command to create the copy files for the given row time_start to time_stop dates.

perl process_copy.pl environment_copy.xml create 2007_01_01_00-05 2007_02_01_00-05

This should create copy files (which are actually csv files) in your dir_tmp_cpy_local directory with the filename format of each file being <table_name>:<time_start>:<time_stop>

multi_obs:2007_01_01_00-05:2007_02_01_00-05.csv

If a table does not contain any rows within the time_start to time_stop interval , then no copy file will be created.

copy

Running the following command will attempt to copy (load) all the copy (.csv) files using the given time range to the target database. The command also after copying to the target database, gzips(compresses) the file, removes the original .csv and places the .csv.gz file into the dir_cpy folder.

perl process_copy.pl environment_copy.xml copy 2007_01_01_00-05 2007_02_01_00-05

Note that the table indexes are still in effect and any duplicate rows will cause the index to reject those rows and the entire copy operation will fail for that table. The idea in general is that the create, copy and delete steps are run regularly(cron) and that the time ranges used are adjacent but not overlapping. The copy files are still present and the copy command can be re-run once any problems have been addressed.

delete

Running the following command will delete (in the proper table dependency order) rows from the tables of interest for the given time range. This step is optional depending on what your database/storage plans are. If your source table is performing fine without deleting anything, then it's fine to skip this step - deletion is meant as a final step for users tight on space or needing to limit table sizes for performance reasons.

An additional step one could add as a precaution is to check for the existence of the earlier copy files before proceeding with the delete. Assuming the copy files are still present and the delete command can be run at any point as needed.

perl process_copy.pl environment_copy.xml delete 2007_01_01_00-05 2007_02_01_00-05

Further notes

Generally speaking after adding or removing rows from your databases, you'll want to run 'vacuum analyze' and 'reindex' commands as necessary to maintain db performance.

Setting up repeated automatic replication with cron

The below crontab entry should run this copy_steps.pl file once a week every Sunday at midnight, writing any errors/output to a log file.

0 0 * * 0 cd /usr2/home/jcothran/cc ; perl copy_steps.pl >> /tmp/copy.log 2>&1

The copy_steps.pl file will get the current date, hour and timezone and create a time_start argument which is the $time_offset number of hours older than the current time and a $time_stop argument which is $time_interval number of hours older than the time_start argument. You'll want to change the $time_offset and $time_interval number of hours to match your needs. The default setup is designed to work with a weekly cron copying 1 week of data older than 2 weeks.

# $time_offset = number of hours older than current time to assign time_start ; this is working on the idea that new data is arriving into the database all the time, but after a certain number of hours we can assume that we have collected all the data we are going to collect
my $time_offset = 336; #336 = 2 weeks
# $time_interval = number of hours between time_start and time_stop ; should be equal to the time interval between when the cron job which calls this process is run, so if 168 = 1 week, then the cron should also be run once a week
my $time_interval = 168; #168 = 1 week

The default copy_steps.pl program also contains the following lines which are designed to run the create, copy and delete steps in sequence - initially the delete step is commented out and can be optionally uncommented if needed after testing that the create and copy steps are completing successfully.

`perl process_copy.pl environment_copy.xml create $time_start $time_stop`;
`perl process_copy.pl environment_copy.xml copy $time_start $time_stop`;
#the below delete step is commented out initially, uncomment when confident that the earlier steps and copy file creation is working
#`perl process_copy.pl environment_copy.xml delete $time_start $time_stop`;

Final steps

Let the first couple of crons run without the delete step just to see how things are going and if things look well from there, optionally add the delete step.

The last step would be to handle older data missed by the initial cron. Manually run through the create, copy(and optionally delete) steps with an older initial time range large enough to capture this data.

And to repeat, generally speaking after adding or removing rows from your databases, you'll want to run 'vacuum analyze' and 'reindex' commands as necessary to maintain db performance.
to top

End of topic
Skip to action links | Back to top

Edit | Attach image or document | Printable version | Raw text | More topic actions
Revisions: | r1.4 | > | r1.3 | > | r1.2 | Total page history | Backlinks

You are here: Main > XeniaReplication

to top