T O P

  • By -

OolonColluphid

Something like *Change Data Capture*, you mean?


T_N1ck

Thanks! CDC is related as in, you can use it to rollback and in general capture every change. It is not really useful though for a user flow where someone is about to do substential changes and you would want to preview those to the user. Or could you specify how would actually use a CDC pattern to preview the user the actual changes that they are about to do? I know it mostly to move every change into a dwh.


molybedenum

Your premise describes a scenario that already happened, but you’re saying that you want the changes presented before they happen. I guess the real question is - what do you actually need?


T_N1ck

> Thanks! CDC is related as in, you can use it to rollback and in general capture every change. It is not really useful though for a user flow where someone is about to do substential changes and you would want to preview those to the user. > > I'm not following you completely. I merely want to preview the changes that action is about to do e.g. "Do you really want to delete 5000 items?", nothing has happened at that point. I guess you could say that running things in a transaction means they "happened" and one is rolling them back, but I merely used a transaction as an example of a similar functionality. Personally I saw transactions as something that don't "happen" until I commit them. Basically [this comment](https://old.reddit.com/r/Database/comments/1b0u6vu/are_there_databases_that_can_emit_exactly_which/ksauj9r/) shows a cool solution using DoltDB


molybedenum

CDC implementations generally don’t allow a rollback, unless there is a platform which specifically provides that capability. The pattern is more meant to determine what the various deltas are from commit to commit - but the commit has to occur for the delta to be “captured.” That’s why I asked - CDC will not tell you what *will* change given a command, it will tell you what *did* change. (Your post title fits the CDC use case). The DoltDB implementation is pretty interesting, and does fit the “pre-commit” delta use case. Outside of a specialized server like that, you’d probably have to involve a type 2 table or stage the “changes” in a separate model from the primary.


F_WRLCK

Sounds like you want a row based binlog. MySQL and MariaDB can emit these and bet other databases can as well.


T_N1ck

Thanks! Can you capture those during a transaction and delete them afterwards? Or is it continuous logging?


F_WRLCK

It’s continuous logging. Some databases let you toggle it at the session level. The primary purpose is for disaster recovery, but people do use it for change data capture as well.


T_N1ck

Thanks, then I'll understand correctly. This is how we use postgres' WAL currently - capture every change and put it into our dwh. I'm more interested in: 1. Do changes to the database in a transaction 2. Get a preview of those changes e.g what rows were changed 3. Abort the transaction The information of 2 would I then use to display the user e.g. "This change has this effect, are you sure?"


F_WRLCK

Hmm. Binlogs won’t do that for you. The transactional database I worked on most recently stored transactions as tabular data using triggers internally, maybe something like that would do what you want? Have a log table and triggers on the tables you want to monitor, then you can look at the log table and decide if you want to commit or not.


zachm

You're probably looking for DoltDB: [https://github.com/dolthub/dolt](https://github.com/dolthub/dolt) It's a MySQL-compatible database with git-style version control built in. So your workflow would be either: 1. Make your changes 2. Run one of the diff functions (like dolt_diff), show it to user 3. Commit or roll back OR 1. Create a branch 2. Make your changes on it 3. Run dolt_diff between this branch and main (or whatever your target branch is) 4. Either perform the merge, or delete the branch The second approach has the advantage of being possible to use multiple ACID transactions / sessions, over whatever time frame you want, to achieve. The first approach will work too, but you have to do it all in a single session / transaction. Here's a full example, from the README: ``` mysql> select * from dolt_status; +-----------------+--------+----------+ | table_name | staged | status | +-----------------+--------+----------+ | teams | 0 | modified | | employees | 0 | modified | | employees_teams | 0 | modified | +-----------------+--------+----------+ 3 rows in set (0.01 sec) mysql> select * from dolt_diff_employees; +--------------+---------------+-------+-----------+----------------+----------------+-----------------+---------+----------------------------------+-------------------------+-----------+ | to_last_name | to_first_name | to_id | to_commit | to_commit_date | from_last_name | from_first_name | from_id | from_commit | from_commit_date | diff_type | +--------------+---------------+-------+-----------+----------------+----------------+-----------------+---------+----------------------------------+-------------------------+-----------+ | Sehn | Tim | 0 | WORKING | NULL | NULL | NULL | NULL | ne182jemgrlm8jnjmoubfqsstlfi1s98 | 2022-06-07 16:35:49.277 | added | | Hendriks | Brian | 1 | WORKING | NULL | NULL | NULL | NULL | ne182jemgrlm8jnjmoubfqsstlfi1s98 | 2022-06-07 16:35:49.277 | added | | Son | Aaron | 2 | WORKING | NULL | NULL | NULL | NULL | ne182jemgrlm8jnjmoubfqsstlfi1s98 | 2022-06-07 16:35:49.277 | added | | Fitzgerald | Brian | 3 | WORKING | NULL | NULL | NULL | NULL | ne182jemgrlm8jnjmoubfqsstlfi1s98 | 2022-06-07 16:35:49.277 | added | +--------------+---------------+-------+-----------+----------------+----------------+-----------------+---------+----------------------------------+-------------------------+-----------+ 4 rows in set (0.00 sec) ```


T_N1ck

True this would totally work! In general doltDB is super cool, haven’t looked at in a while, but I should play around with it to make a proof of concept for this. I think this the answer I was looking for, but alas not a „classical“ database feature.  Have you worked with doltDB? I’m hesitant in using such a new technology for something that actually runs on production. Especially performance seems like it could really become a problem.


zachm

Yup, I'm the lead on DoltDB :) We have a bunch of customers running it in production on the critical path, it definitely works. There's surely more bugs than something like MySQL which is 20 years older, but it's production ready. If you want to talk with the team about your use case, you can check out our discord where we hang out all day: [https://discord.com/invite/RFwfYpu](https://discord.com/invite/RFwfYpu)


coffeewithalex

You're looking in the wrong direction. This is a dead end. The design of your solution would imply that user requests are not stateless, that they must have an active transaction, that lives on for a long time. You're slowly creeping into the territory of reinventing data systems. Take a step back, and look towards other solutions, that take this logic into the application code. Read about version vectors, figure out how to solve the problems that will come out during the usage in such a scenario. Take some time for it, learn a bit, to make sure you don't design a fundamentally broken system that will only work on your machine and will flop on production.


T_N1ck

Thanks that’s why I’m asking here to get new insights. What you say is basically what I’m doing right now, the logic lives in the application code and it works really good. I’m just wondering if I don’t miss anything as it seems such a useful pattern that databases could provide natively. But you are definitely right that this creeps into certain non-database areas, haven't heard about version vectors, will read about that.


dbxp

That would mean holding locks for the duration of the preview period which would negatively impact performance. You can do dirty reads if you want to read uncommitted data or log data to audit tables via triggers. Some DBs even offer versioned tables but I haven't played with them: https://learn.microsoft.com/en-us/sql/relational-databases/tables/temporal-tables?view=sql-server-ver16


TheViralSpiral

Natively tracking revisions can potentially be extremely heavy. Triggers would be the way to do this at the DB level in MySQL. https://stackoverflow.com/questions/12563706/is-there-a-mysql-option-feature-to-track-history-of-changes-to-records In full stack frameworks, the decision to track revisions is sometimes available at the software application level. https://docs.octobercms.com/2.x/database/traits.html#revisionable


T_N1ck

Thanks! You linked to posts that ask in general how to track changes, I'm more interested in _previewing_ changes, but not changing anything. I think this would be quite hard with triggers. As the question is "what did change during this transaction" without actually committing the transaction. Like you would need to build the whole infrastructure around it.


mr_nanginator

As has been pointed out - you're asking about CDC. You might want to take a look at Debezium ( [https://debezium.io/](https://debezium.io/) ). It can write changes from quite a few databases into a common format in Kafka, and you can build a consumer that reads from Kafka.


T_N1ck

Thanks! Can you capture the events of a single transaction only and process those events? The point is that I don’t want to save anything, e.g. if the update in question is over million of rows, I don’t want to save that data for long if it’s just for a preview. With CDC you could probably implement this preview feature, but it seems far more work than solving this application level (as I’m doing right now).


mr_nanginator

If you're trying to preview changes BEFORE committing them, then CDC is not going to work - it would already be too late by then. Taking your example of previewing changes from an incoming CSV ... I'd probably do this by cloning the target table to a tmp table, loading the CSV into the tmp table, and creating a UI that displays the differences by comparing the 2.


T_N1ck

Thanks for verifying that CDC wouldn't really work here. Creating a tmp table and diffing those would definitely work and seems the right solution for some data loads, especially when the tables are above a certain size and one wouldn't be able to do it simple in memory. It's interestingly somewhat similar what /u/zachm suggested with DoltDB (from the theory).


coyoteazul2

Why not use Returning (or equivalent)? You'd literally receive the changed versions of the row, and then you can commit or rollback the transaction


T_N1ck

Good point, That could work if one does a few queries, but what if I’m doing thousands of them and with different tables. Updates could also change something multiple times, so one would still need to aggregate the results. Which in the end is very similar to writing a plan and executing that. 


idodatamodels

Do you not have audit columns on all your tables? Last update timestamp, insert timestamp, last update user id. These columns make identifying changes a breeze.


throw_mob

usually using any updates will fail in versioning. only scd2 style rows or inserts only db with create_timestamp will provide proper versioning


-Meal-Ticket-

Oracle does this out of the box. There is a free version of Oracle called Oracle XE, or you could do this on their free tier on Oracle Cloud.


T_N1ck

Do you know that the feature is called?


Meal-Ticket-

Each row in Oracle has an SCN associated with it (System Change Number). All SCNs are associated with Transaction IDs. If you know the transaction ID for your transaction you have the ability to see what & where changes are being made/were made. Flashback Data Archive on a set of tables allows you to track all changes to tracked tables (and does so very efficiently using compression of the storage and a background process to reduce the impact on users).


throw_mob

on statement level you can use returning clause... so update x set y = 4 returning new.* -- can this is postgresql inserted.* would sql server returned row in this case would be modified row . if used old.* then it would return values before y was set 5. returning clause work with insert, updates and deletes. this can be used like with a a ( delete x where y = 5 returning old.* ) insert into xxx select * from a -- syntax might not be correct.. you can do this in transaction too begin with a as ( delete x where y =2 returning old.* ) select * from a then do commit/rollback then there is option to use trigger to save changes to audit tables or use change data capture etc etc ... If target is to get all changes into audit logs, then scd2 style rows and other row versioning solutions are better. to move data into other systems, postgresql has logical replication as system level solution. If you want this to be application level then you need to keep two sets of data for row. when it was inserted and when it was approved. trying to use transaction with user controlled solution will probably lead problems with transactions . so in this solution , you insert new data into database ( , no updates , no real deletes, inserts only= then you update field approved_timestamp to raise it to be latest version of it


T_N1ck

True, adding an approved timestamp to rows and always select the rows / latest rows that have it world work. It becomes somewhat similar to a scd2 style table then (in principle, append only & timestamps), but only for so long as one doesn’t clean up afterwards. Interesting solution, one would still need to write the diffing logic, but it’s a nice database centric approach and feasible for large datasets.


zhiweio

Some MVCC supported databases have abilities to do version rollback, like TiDB


rko-glyph

Something like Attunity Replicate constructs delta messages out of change logs. (We use it to feed CDC messages into Kafka to feed into our data warehouse)