Forum Moderators: open

Message Too Old, No Replies

Any experience using Apache Cassandra for websites?

         

lammert

5:56 pm on Jan 19, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My current website database backend is a group of MariaDB servers synchronizing with Galera cluster. It is a multi-master setup which is great for availability. Availability is one of the main concerns for this cluster because of some applications using the database. If one server is taken out of the cluster, the other nodes just continue. Failover between servers is automatic through haproxy for internal clients, and DNS failover for external clients.

As the database is growing, I will need to increase the cluster size around the end of this year. Basically there are two options. One is to scale up and use larger more expensive hardware per node. The other option is to scale out and use more commodity hardware nodes. When increasing the storage capacity, I also plan to increase the performance to avoid performance becoming the bottleneck in the future.

MariaDB with Galera doesn't scale out very well. All data is stored on each node, not sharded between multiple nodes. Also increasing the number of nodes in a MariaDB/Galera cluster increases write latency which can become a bottleneck. I am therefore looking at other high availability architectures that still allow for one or more nodes to crash and which can be distributed over multiple data centers. Apache Cassandra seems the best candidate so far. Write performance increases when the number of nodes increases and sharding is part of the basic architecture.

Can anyone here share experience in running Cassandra as a database backend for websites?

pontifex

6:31 pm on Feb 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Lammert, hope the storm has passed you quickly and well yesterday night :-)

Regarding your question, I would first start a little discussion about your setup:

In my experience, there are many ways to solve scalability issues. It can be hardware, it can be a specific configuration of an existing suite and it also be a migration. In your case I would first ask: you have no experience with Cassandra, but with MariaDB? In that case: if you migrate to software solution XX (here Casandra) I am always hesitant due to the lack of know-how within my organisation.

I would only recommend such a step, if you have a staging option and can test out the new software in your setting with a stress test and get good at administration of the set.

MariaDB can scale massively, if handled right.

This is a good read: [blog.scottlogic.com...] (not affiliated, no ad!)

Are you sure, that scaling and sharding with MariaDB in combination with an indexing server (Lucene, sphinxsearch, etc.) is not easier to handle for you, because you already know MariaDB well? How many write operations do you have in a day? How big are the tables?

Cheers,
Ralf

lammert

9:56 pm on Feb 10, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The storm was just a minor annoyance. We sent it to our Eastern Neighbors to deal with :)

I understand your hesitation about switching architectures unless it is fully tested and accepted to be the better solution. It can be a recipe for disaster. I am currently in the luxury position that I haven't grown out of my current setup. As I mentioned in the first post that is expected to be at the end of the year. I also have a cluster available to run Cassandra on in parallel for a prolonged time before making the switch.

The database usage is somewhat unusual, in that there are much more insert than read operations. It is not a regular website but an online service where product and production data is stored for multiple clients. The data grows steadily for each client, as does the number of clients connected to the system.

They use it to check daily operations and to search for historic trends to predict future production needs. Normally you would archive older unused chunks of data and store it in a safe place, but due to regulations, this isn't possible. As the products are public health-related, in case of product or production failure, there are constraints on the length of the time period to determine which other products are related and must be recalled. Recalls can be based on issues in multiple paths of the supply and production chain and you won't know in advance. Therefore all information must be available in near real-time to be able to make a fast recall decision when a problem arises.

With the current multi-master MariaDB setup, all writes are distributed to all nodes. If you increase cluster size, the time this takes increases as you can also see in the graphs in the post you mentioned (which I had already found and studied earlier BTW). More nodes also increase the chances of deadlock situations because data can be streamed to multiple masters from one client for redundancy purposes.

While researching I found Cassandra to match best with my needs. You can set a number of nodes on which each data item must be replicated. This can be high enough to provide redundancy, but lower than the number of servers. With my current MariaDB multi-master setup, the number of storage nodes for an item is also the total number of servers.

Another reason why I liked Cassandra over other no-sql databases like MongoDB is that it has no single bottleneck. In MongoDB data is passed through one master which distributes it over a cluster of slaves.

But if you think MariaDB can scale on write operations easily without adding much overhead, I would be happy to learn. It is always better to stick with something which works, and has proven so the last years if there is no reason to replace it.