My dev blog where I dive deep into TypeScript, Postgres, Data science, Infrastructure, Ethereum, and more...

Snowflake vs Clickhouse for tracking event data

3rd Aug 2023

As our product progresses further, our data stack becomes more and more important.

If we have a good, scalable, easy-to-use data stack, we could capture lots of data and build actionable analytics and recommendation engines. With UserAddresses, our data already has potential to be our most valuable asset. If we could hook in other sources, and collect a variety of event data, we could further enrich this proposition.

Specifically, this is what I wanted:

  • A function in our TS code we can use to send events with some data.
    • void sendEvent("QUEST_VIEW", {...})
  • Make the data available in BI tools (Tableau for example)
  • It shouldn’t be too expensive. We should feel that we can register lots of data without worrying about spiraling costs.

The data landscape

There are lots of tools in the data landscape.

Here is my rough take on what’s out there:

  • CDI - Tools to collect behavioral data or event data from apps/websites and put it in a data warehouse
    • Snowplow, Rudderstack (when sending events to it)
  • ETL - scripts for extracting data from a source place and put it in a data warehouse
    • ETL-as-a-service: Stitch, Rudderstack (when syncing from a source)
  • Data warehouse
    • Boring: Google BigQuery, Amazon Redshift, Azure Synapse Analytics
    • Cool: Snowflake, ClickHouse, Firebolt (built on ClickHouse), Databricks

We should use a data warehouse

My gut feel is that starting to store our data in a data warehouse is a good idea.

  • Native integration with lots of BI tools
  • We own the data, it’s not stuck in different 3rd party tools
  • Sets us up for a future where we use the data in production queries (example: recommandation engine that uses more data than just postgres data), and with data science teams

We should not use CDP platforms like Mixpanel

CDPs like mixpanel make it super easy to track events:

void mixPanel.sendEvent("QUEST_VIEW", {...})

However there are two drawbacks:

  • They don’t integrate natively with BI tools (like Tableau). Meaning we need an ETL service to sync things to a data warehouse.
  • Since we use Tableau, the whole data visualization part of the CDP is kind of useless. We are paying for more than we need if we only use it for data capture (like a CDI).
 
 
We should not use services to transform data (ETL services)

Lots of tools are centered around transformations and moving data around.

Example: Stitch (get data from source A to destination B), or getDbt (transforming data inside data warehouse)

It makes sense when you are a shitty corportation with lots of unstructured and weird data at different places.

We should avoid these workflows. We are a small team, and one of our advantages is that we can keep the data structured and clean most of the way in the pipeline.

 

Getting data into the warehouse: CDI vs custom direct code

Since we don’t want to use ETL services, there are two ways to get data into the data warehouse.

  • CDIs. Services that provide a nice TS library, and stores data in the data warehouse for us.
    • Cons: price, data might not be structured as we want
  • Direct interface. (connect from our ts code to the sql server)
    • Cons: need more code. Handle db table creation, migration (no prisma magic). More manual things to take care of.

I implemented our first test stack with a CDI (Rudderstack). It ended up storing userIds as strings (super annoying for yahya) and hit the 1.000.000 event cap after 10 days. (steep pricing curve after that). In the end of the day, it seems not worth it for a service that basically just shuffles data from A to B.

Choosing a warehouse: Snowflake vs Clickhouse

“Snowflake is like a shipyard, clickhouse is like a jet ski.” (quote)

  • Snowflake is more feature rich
  • Snowflake scales better for big use-cases (separated storage compute)
  • ClickHouse is really fast
  • ClickHouse seems to be more a DB for analytics stuff, where Snowflake is more dump all your data and query it
  • Snowflake has broader BI support
  • 💰 for pricing I don’t know 🤷
  • ClickHouse is open source, there are multiple providers
ClickHouse is cool on twitter
  • ClickHouse has a better npm package
 

It would be cool to try ClickHouse. My gut feel is there now.

 

Tools