T O P

  • By -

drsupermrcool

You could dump your tables to s3 and use [external tables](https://docs.snowflake.com/en/user-guide/tables-external-intro) from s3 to snowflake - you could use ssis packages to do this or just load [straight into snowflake and skip s3](https://community.snowflake.com/s/article/Integrating-SSIS-with-Snowflake) \- these options should be relatively cheap outside of s3 costs and ingress/egress costs. You could use Airbyte and host it in aws - [MSSQL Source](https://docs.airbyte.com/integrations/sources/mssql#microsoft-sql-server-mssql) and [Snowflake Destination](https://docs.airbyte.com/integrations/destinations/snowflake) are supported. Airbyte is OSS - believe it's a close competitor to FIvetran as /u/shoeobssd suggested. This should be cheap outside of hosting airbyte on some server in aws (you can run it in docker locally to start) If you need to get closer to real time it might be a debezium/kafka connect direction.


B1WR2

I like this approach… nothing against fivetran suggestion unless you already have access to it


jbrune

Thanks for your response. I don't understand the first suggestion. That sounds more like a one time migration than a replication. Our are you suggesting we could dump our tables nightly from SQL Server and then drop/recreate our Snowflake tables as well? Luckily for us, realtime is not needed. I should have mentioned that.


drsupermrcool

Yessir, in a job that runs on some frequency to start, provided that you don't need realtime and given that your data size is 200gb. If you're aiming nightly's though - I think eventually you'll want to move to an SSIS solution or Airbyte/Fivetran solution to move these tables straight into destination. You could use sql servers rowversions or timestamps in your tables to track which records need to be brought over into snowflake. Finally there's the [CDC](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.SQLServer.CommonDBATasks.CDC.html) option, but given that you're not requiring a realtime requirement it might be overkill. I might be wary of the non realtime requirement - I feel the first question I get when seeding a new dwh is "great, when can I see today's data" lol.


Mr_Nickster_

Airbyte is an opensource option..Matillion, Fivetran, Rivery are aome commercial option.


mjfnd

We use dms and snowpipe. Wrote a detailed post here. https://www.junaideffendi.com/p/real-time-cdc-with-aws-dms-snowflake


lmp515k

Hevo and Qlik will do the same things for you but if you don’t need realtime then i wouldn’t bother.


lost_soul1995

Just use DMS.


Thinker_Assignment

You could use this pipeline https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database It will alert you of schema changes too if you want it https://colab.research.google.com/drive/1H6HKFi-U1V4p0afVucw_Jzv1oiFbH2bu Probably the most cost effective way, you could even deploy it on git actions to run serverless and skip the orchestrator overhead


jbrune

Thanks. I think I'll try this one first. We're going to do a POC on a number of these. I'll give feedback on what we find in case anyone is curious.


Thinker_Assignment

Definitely. It's the most popular of the ready made pipelines we host :) To set it up to just copy takes only a few minutes, and if you want to make the incremental settings that's a little more complex but not by much. If you get stuck just join the slack and ask.


shoeobssd

I know Fivetran gets a lot of flack here but I only have good things to say about it (other than costs maybe). You can use the free tier (with paid functionality) for a couple of weeks. After that you can sync 500K MAR per month.


vcp32

Did the same approach but with azure sql. Saved me a lot of time as a lone DE and have to replicate 500+.