SOLR-17492: Introduce recommendations of WAYS of running Solr from small to massive #2783

epugh · 2024-10-19T15:13:14Z

https://issues.apache.org/jira/browse/SOLR-17492

Description

Add recommendations of best practices for deploying Solr

Solution

I am starting with my approach that I shared at Community/Code NA, just to get us moving. I would love the wisdom of the community. We have many areas where different folks have knowledge, and it's all pretty tribal. I'd like to get it all written down so folks don't have to relearn the same thing over and over.

Tests

no tests, but do need eye balls!

…bit to not be "prod embedded zk bad"

This is going to be a lot of text and diagrams and may become multiple pages.

epugh · 2024-10-19T16:31:53Z

We have diagrams generated in our Markdown!

epugh · 2024-10-22T16:51:17Z

First pass in done! I have put in as NOTE: a number of places where more input is needed. I think this could be a good page to discuss as a group at a Community Meetup, make sure we are going in a direction that the community supports.

ardatezcan1 · 2024-11-30T16:27:05Z

Whether you're just getting started with Solr or looking to fine-tune an existing setup, these practical tips and real-world scenarios may help you get the most out of this powerful search platform.

Best Practices for Using Solr

1.Run Solr as a Cluster for Better Performance
Solr works best when deployed as a cluster. Start with at least three nodes for fault tolerance and scalability, and scale horizontally as your needs grow.

Sharding and Replication: Break your data into shards for parallel processing and use replicas for redundancy. A good starting point is two replicas per shard, but adjust this based on your workload.
Optimize Indexing: Carefully plan your schema to ensure efficient indexing and querying. Use dynamic fields and copy fields where appropriate to keep things flexible without overloading your system.
Caching for Speed: Solr provides powerful caching options like query, document, and filter caches. Use these for frequently accessed data to speed up query times significantly.
Tune the JVM: Since Solr is Java-based, JVM tuning is crucial. Adjust heap size to balance memory usage and garbage collection. Monitor GC logs and experiment with policies like G1GC or CMS for optimal performance.

2. Always Use Solr in Cloud Mode
For a robust, scalable setup, Solr Cloud Mode is the way to go. This setup requires ZooKeeper, which manages cluster coordination, leader election, and configuration.

ZooKeeper’s Role: ZooKeeper ensures your Solr cluster runs smoothly by handling shard placement, failover, and configuration changes dynamically.
Backups and Security:
-Always back up your Solr and ZooKeeper data regularly. Use Solr's built-in backup tools or external snapshot mechanisms for safety.
-Secure your cluster with SSL/TLS, and set up role-based access control, ideally with tools like Apache Ranger. If Ranger isn’t an option, manual permissions management works too.
Monitoring is Essential: Keeping an eye on your Solr cluster is crucial for ensuring smooth operations. A great place to start is the Solr Web UI, which provides a user-friendly interface to monitor metrics like query performance, index health, and cache usage. It's easy to use and perfect for quickly spotting any issues. For more advanced needs, you may integrate tools like Prometheus and Grafana for custom dashboards and alerting. However, I should mention that I don’t have direct experience with Prometheus or Grafana specifically when working with Solr.

Using Scenarios: Real-World Applications of Solr
1. Managing Solr for a Large Dataset
I used open-source Solr as a search engine for a mobile app. Instead of interacting with Solr directly, I managed the setup via ZooKeeper APIs. Here’s what that looked like:

Cluster Configuration:
The cluster handled over 100 TB of data spread across 11 physical machines, each running 16 Solr instances.
Sharding and Replication:
Data was stored in shards, with each shard having two replicas to ensure fault tolerance and load balancing.
Data Storage:
Data was stored directly on the local file system, which was a great fit for this use case.
Management Approach:
Instead of accessing Solr directly, I managed the system via ZooKeeper APIs. This approach, even with an embedded ZooKeeper, worked efficiently under heavy load.

2.Using Solr with Cloudera and HDFS
Another scenario involved deploying Solr in a Cloudera ecosystem with HDFS for storage. Here’s what worked and what didn’t:

Cluster Management:
ZooKeeper handled cluster coordination, while Ranger (and previously Sentry) managed permissions.
Challenges:
Occasionally, node failures caused HDFS file locks, which were difficult to resolve without downtime. These required manual fixes and a lot of patience!

If you’ve got questions or need help with something specific, just let me know. I’m happy to share more!

github-actions · 2025-01-30T00:00:25Z

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

epugh · 2025-01-30T15:19:17Z

This remains on my "must do" list for Solr 10, and I will pick it up as we get closer ;-).

github-actions · 2025-04-02T00:00:28Z

This PR has had no activity for 60 days and is now labeled as stale. Any new activity will remove the stale label. To attract more reviewers, please tag people who might be familiar with the code area and/or notify the dev@solr.apache.org mailing list. To exempt this PR from being marked as stale, make it a draft PR or add the label "exempt-stale". If left unattended, this PR will be closed after another 60 days of inactivity. Thank you for your contribution!

github-actions · 2025-06-02T00:00:34Z

This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again.

epugh · 2025-06-02T18:32:25Z

I am kind of waiting for the 10x release cycle to spin up to push this along. There are some things I would change/update in this doc if we get some nicer ZK quorum stuff and role stuff done...

epugh · 2025-06-18T11:56:46Z

@tboeghk this is what we talked about in line for lunch!! Would really appreciate your perspective.

tboeghk · 2025-07-22T19:20:51Z

In addition to the great summary of @ardatezcan1 above here are my practical tips and real-world scenarios to run Solr in a high rpm and low to medium dataset environment (like ecommerce appliations).

Best practises using Solr in high rpm environments

Before starting to optimize your Solr setup, make sure to have strong observability in place. In addition to the Solr Prometheus and Grafana setup I strongly recommend setting up the Node Exporter to gather and correlate machine metrics.

Use Solr in cloud mode: Running Solr in cloud mode and in a Zookeeper ensemble is a prerequisite to the following best-practices. Cloud mode enables easy addition and removal of Solr cluster nodes depending on the current traffic.
Sharding: Request processing in Solr is a single threaded operation. The larger your dataset the more latency you'll add to request processing. The only (sustainable) way to make query processing a multi-threaded operation is to shard your index. Depending on your workload, you could simply run multiple Solr instances on the same machine. I recommend a single Solr instance per machine though.
- Sharding strategies: If your query processing strategy uses collapse (and expand or grouping), make sure to put all documents to a grouping key on the same shard. Adjust the document routing and router.field to your grouping key.
Indexing and optimization strategies: Indexing into a live collection adds significant latency to your search requests. Each commit flushes the internal caches and those caches keep Solr running fast. Avoid any unnecessary cache flushes!
- Optimize your index: Manually optimizing your index is not recommended but delivers the best performance as deleted documents are pruned from the index.
- Rotate collections: For smaller to medium datasets it might be a good strategy to periodically index your data into a new collection instead of updating an existing one. That way, requests caches stay warm for the lifetime of a collection and a manual optimize is possible. Use collection aliases to switch clients to the new collection.
Use dedicated node setups: In high traffic environments, a separation of concerns gets more important. Use dedicated node types and machine sizings/setup for optimal perfomance tailored to the machines role.
- Indexer: Solely used for indexing products. Set up as TLOG replica type. Must not be used for request processing. Exclude TLOG node types from request processing using the shards.preference parameter configured at your request handlers.
- Data: Set up as a PULL replica. Replicates it's index from the indexer nodes via Solr cloud. Using TLOG and PULL replicas avoids that index data is being pulled off data nodes (as with NRT replicas).
- Coordinator: In sharded Solr cloud setups, these nodes coordinate the distributed request flow and assemble the final search request result. This is a very CPU intensive operation and is usually shared among the data nodes. The usage of dedicated coordinator nodes separates the compute overhead of coordinating distributed requests off of the data nodes. Adding coordinator nodes to a Solr cloud setup will drop the resource usage on data nodes significantly. To make full use of coordinator nodes, direct all incoming request traffic to these nodes.
JVM tuning: I highly recommend running Solr on G1GC garbage collector. Keep in mind the golden rule of keeping 50% heap for disk cache on data and indexer nodes. As coordinator nodes are stateless, you can boost their performance significantly with the ZGC garbage collector. It slashes collection pauses from milli- to nanoseconds.
Cloud setup: Most Solr cloud setups will run in some kind of cloud environment. Here are some tipps to setup an elastic Solr cloud environment.
- Autoscaling: Use a dedicated autoscaling group for each node type and each shard. Use tags to mark which instance should replicate which shard. Configure your heap settings dynamically and configure a wide range of instance types. Build a custom script to replicate data upon instance start. Use the Solr collections api to remove a node from the cluster
  during instance termination.
- Spot instances: Coordinator and data nodes are great to run as spot instances. This will save a big bunch of cloud spendings.
- ARM instance types: Utilize ARM instance types wherever possible. The Solr Docker image is also pre-built for ARM architectures. ARM cpus offer the best bang for the buck and a more consistent response latency (as their CPU is not power managed).

If you need more information or help to compile the whole information into a single document let me know!

epugh · 2025-07-22T19:40:12Z

@tboeghk so I have a new taking-solr-to-production.adoc doc, that tries to be a opinonated scaling. I think that a LOT of what you mentioned makes sense at the Moving Beyond the Basic Cluster scaling point... Which I listed as in the six to 12 nodes in your cluster.. I know all "best practices" could be done earlier, but I'm trying to frame this as "When you get to this size, you need to do this"... THoughts? The number of nodes to me, while a simplistic measure, is also the easiest to expliain versus query load, index load, data load that would be more complex to decide "where am I"....

epugh · 2025-08-18T13:01:50Z

@tboeghk and @ardatezcan1 I've update this branch to run with the latest version of Solr. My goal is to get this doc in (in one form or another) before Solr 10 comes out. If either of you wants to edit the doc to factor in your suggestions, please feel free. Otherwise I will try and farm your comments and add them, but it'll be more from my own personal perspective.

epugh · 2025-08-18T13:08:33Z

For those who haven't seen it, we are now generating diagrams from ascii mark up! I am excited to make it easier to add diagrams to Solr that don't require a binary image that then is hard to update.

…opology of deployments suggestion.

epugh · 2025-09-16T17:08:32Z

Some good progress.. If #2391 happens then this is good to go. If 2391 doesnt' before 10, then I'll edit this and then merge it.

epugh · 2025-10-09T16:21:06Z

Hi all who have contributed to this long lived PR! With Solr 10 being close to being released, I wanted to bend this towards something mergable. I've edited the doc down, and there is only one TBD that needs editing before this can be merged.

The doc is narrower than this PR suggests, however I think there is a "Extreme Scale' or some such doc that oculd be made that would take in a lot of the feedback provided.

…wo Solr nodes.

epugh · 2025-10-13T23:21:32Z

In order to not have "forward looking" text in Ref Guide, need #2391 to get in... I am going to take a stab at it tomorrow..

epugh · 2025-11-29T14:39:26Z

See https://issues.apache.org/jira/projects/SOLR/issues/SOLR-17507 for when we get this in. Maybe break it up into two, one side for small examples, and then in 10.1 or later the full doc?

Removes a dependency on an external server for generating the image.

epugh added 3 commits October 19, 2024 11:08

Look at some sections that make editorial comments and rework them a …

d93a1ea

…bit to not be "prod embedded zk bad"

Add Diagramming from text capablity

677d71f

Add in a single page to talk about deployment strategies.

e8a1655

This is going to be a lot of text and diagrams and may become multiple pages.

github-actions bot added documentation Improvements or additions to documentation tool:build labels Oct 19, 2024

First pass through thinking about scaling up Solr.

ebbe649

epugh added 3 commits October 19, 2024 14:13

More

b379820

Fix up references

728c4b2

General framework for thinking about this is in place

37ceed0

epugh mentioned this pull request Nov 1, 2024

SOLR-16957: Test user managed cluster with a twist! #1875

Open

github-actions bot added the stale PR not updated in 60 days label Jan 30, 2025

github-actions bot removed the stale PR not updated in 60 days label Jan 31, 2025

github-actions bot added the stale PR not updated in 60 days label Apr 2, 2025

epugh mentioned this pull request Apr 5, 2025

SOLR-17496: Remove a warning that is no longer applicable. #3311

Merged

github-actions bot added the closed-stale Closed after being stale for 60 days label Jun 2, 2025

github-actions bot closed this Jun 2, 2025

epugh added exempt-stale Prevent a PR from going stale and removed stale PR not updated in 60 days closed-stale Closed after being stale for 60 days labels Jun 2, 2025

epugh reopened this Jun 2, 2025

epugh added 2 commits August 18, 2025 08:43

Merge remote-tracking branch 'upstream/main' into SOLR-17492

5904cb9

update to latest build processes

6aad26b

github-actions bot added the dependencies Dependency upgrades label Aug 18, 2025

epugh added 5 commits August 18, 2025 09:18

fix link to glossary

5b193b6

Down to one last broken link

8524a4d

Merge remote-tracking branch 'upstream/main' into SOLR-17492

b34a7b5

Start seperating out use case specific suggestions from the overall t…

570daef

…opology of deployments suggestion.

Reviewed with Mike Drob, DAvid Smiley, Kevin Risdan, and Jason G.

db95000

epugh added 4 commits October 9, 2025 11:15

Remove as it's just not ready to move forward with.

437f8ab

Text review

66917b6

Merge remote-tracking branch 'upstream/main' into SOLR-17492

4ca9274

Revamps. Down to one specific "TBD" that prevents us from merging.

4ebdc6e

epugh added 3 commits October 13, 2025 18:59

Add some internal linking

8aea294

Remove editorial content.

314cb99

Provide specific details for testing out the Failover approach with t…

56f9136

…wo Solr nodes.

epugh added 3 commits January 28, 2026 10:49

Merge remote-tracking branch 'upstream/main' into SOLR-17492

6140d5d

Replace kroki with mermaid js based charts

5f6090f

Removes a dependency on an external server for generating the image.

reformatting diagrams to be clearer

8cebec3

epugh mentioned this pull request Jan 28, 2026

SOLR-18094 Support running embedded-zk in "ensemble" mode with new node role #2391

Draft

7 tasks

this is big enough ref guide change to be worth highlighting to users

64f1cc7

SOLR-17492: Introduce recommendations of WAYS of running Solr from small to massive #2783

Are you sure you want to change the base?

SOLR-17492: Introduce recommendations of WAYS of running Solr from small to massive #2783

Uh oh!

Conversation

epugh commented Oct 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Solution

Tests

Uh oh!

epugh commented Oct 19, 2024

Uh oh!

epugh commented Oct 22, 2024

Uh oh!

ardatezcan1 commented Nov 30, 2024

Uh oh!

github-actions bot commented Jan 30, 2025

Uh oh!

epugh commented Jan 30, 2025

Uh oh!

github-actions bot commented Apr 2, 2025

Uh oh!

github-actions bot commented Jun 2, 2025

Uh oh!

epugh commented Jun 2, 2025

Uh oh!

epugh commented Jun 18, 2025

Uh oh!

tboeghk commented Jul 22, 2025

Uh oh!

epugh commented Jul 22, 2025

Uh oh!

epugh commented Aug 18, 2025

Uh oh!

epugh commented Aug 18, 2025

Uh oh!

epugh commented Sep 16, 2025

Uh oh!

epugh commented Oct 9, 2025

Uh oh!

epugh commented Oct 13, 2025

Uh oh!

epugh commented Nov 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

epugh commented Oct 19, 2024 •

edited

Loading