Tuning orcharhino

Introduction to Performance Tuning

This document provides guidelines for tuning orcharhino for performance and scalability.

Performance Tuning Quickstart

You can tune your orcharhino Server based on expected managed host counts and hardware allocation using built in tuning profiles included in orcharhino that are available using the installation routine’s tuning flag. For more information, see Tuning orcharhino Server with Predefined Profiles in Installing orcharhino Server.

There are four sizes provided based on estimates of the number of managed hosts your orcharhino manages. You can find the specific tuning settings for each profile in the configuration files contained in /usr/share/foreman-installer/config/foreman.hiera/tuning/sizes.

Name Number of managed hosts Recommend RAM Recommend Cores

default

0-5000

20 GiB

4

medium

5000-10000

32 GiB

8

large

10000-20000

64 GiB

16

extra-large

20000-60000

128 GiB

32

extra-extra-large

60000+

256 GiB+

48+

Procedure
  1. Select an installation size: default, medium, large, extra-large, or extra-extra-large. The default value is default.

  2. Run foreman-installer:

    # foreman-installer --tuning "My_Installation_Size"
  3. Optional: Tune the Ruby app server directly using the Puma Tuning section. For more information, see Puma Tunings.

Top Performance Considerations

You can improve the performance and scalability of orcharhino:

  1. Configure httpd

  2. Configure Puma to increase concurrency

  3. Configure Candlepin

  4. Configure Pulp

  5. Configure Foreman’s performance and scalability

  6. Configure Dynflow

  7. Deploy external orcharhino Proxies instead of relying on internal orcharhino Proxies

  8. Configure katello-agent for scalability

  9. Configure Hammer to reduce API timeouts

  10. Configure qpid and qdrouterd

  11. Improve PostgreSQL to handle more concurrent loads

  12. Configure the storage for DB workloads

  13. Ensure the storage requirements for Content Views are met

  14. Ensure the system requirements are met

  15. Improve the environment for remote execution

Performance Tuning System Requirements

You can find the hardware and software requirements in Preparing your Environment for Installation in the Installing orcharhino guide.

Configuring orcharhino Environment for Performance

CPU

The more physical cores that are available to orcharhino, the higher throughput can be achieved for the tasks. Some of the orcharhino components such as Puppet and PostgreSQL are CPU intensive applications and can really benefit from the higher number of available CPU cores.

Memory

The higher amount of memory available in the system running orcharhino, the better will be the response times for the orcharhino operations. Since orcharhino uses PostgreSQL as the database solutions, any additional memory coupled with the tunings will provide a boost to the response times of the applications due to increased data retention in the memory.

Disk

With orcharhino doing heavy IOPS due to repository synchronizations, package data retrieval, high frequency database updates for the subscription records of the content hosts, it is advised that orcharhino be installed on a high speed SSD so as to avoid performance bottlenecks which may happen due to increased disk reads or writes. orcharhino requires disk IO to be at or above 60-80 megabytes per second of average throughput for read operations. Anything below this value can have severe implications for the operation of the orcharhino. orcharhino components such as PostgreSQL benefit from using SSDs due to their lower latency compared to HDDs.

Network

The communication between the orcharhino Server and orcharhino Proxies is impacted by the network performance. A decent network with a minimum jitter and low latency is required to enable hassle free operations such as orcharhino Server and orcharhino Proxies synchronization (at least ensure it is not causing connection resets, etc).

Server Power Management

Your server by default is likely to be configured to conserve power. While this is a good approach to keep the max power consumption in check, it also has a side effect of lowering the performance that orcharhino may be able to achieve. For a server running orcharhino, it is recommended to set the BIOS to enable the system to be run in performance mode to boost the maximum performance levels that orcharhino can achieve.

Benchmarking Disk Performance

We are working to update foreman-maintain to only warn the users when its internal quick fio benchmark results in numbers below our recommended throughput but will not require a whitelist parameter to continue.

Also working on an updated benchmark script you can run (which will likely be integrated into foreman-maintain in the future) to get a more accurate real-world storage information.

  • One may have to temporarily reduce the RAM in order to run the io benchmark, aka if the box has 256 GiB that is a lot of Pulp space, so add mem=20G as a kernel option in grub. This is needed because script will execute a series of fio based IO tests against a targeted directory specified in its execution. This test will create a very large file that is double (2x) the size of the physical RAM on this system to ensure that we are not just testing the caching at the OS level of the storage.

  • Please bear above in mind when performing benchmark of other filesystems if you have them (like PostgreSQL storage) which might have significantly smaller capacity than Pulp storage and perhaps on different set of storage (SAN, iSCSI, etc).

This test does not use directio and will utilize the OS plus caching as normal operations would.

You can find our first version of the script storage-benchmark. To execute just download to your orcharhino, chmod +x the script and run:

# ./storage-benchmark /var/lib/pulp

As noted in the README block in the script: generally you wish to see on average 100MB/sec or higher in the tests below:

  • Local SSD based storage should values of 600MB/sec or higher.

  • Spinning disks should see values in the range of 100-200MB/sec or higher.

If you see values below this, please open a support ticket for assistance.

Configuring orcharhino for Performance

orcharhino comes with a number of components that communicate with each other. You can tune these components independently of each other to achieve the maximum possible performance for your scenario.

Enabling Tuned Profiles

CentOS 7 enables the tuned daemon by default during installation. On bare-metal, ATIX AG recommends to run the throughput-performance tuned profile on orcharhino Server and orcharhino Proxies. On virtual machines, ATIX AG recommends to run the virtual-guest profile.

Procedure
  1. Check if tuned is running:

    # systemctl status tuned
  2. If tuned is not running, enable it:

    # systemctl enable --now tuned
  3. Optional: View a list of available tuned profiles:

    # tuned-adm list
  4. Enable a tuned profile depending on your scenario:

    # tuned-adm profile "My_Tuned_Profile"

Transparent Huge Pages is a memory management technique used by the Linux kernel which reduces the overhead of using Translation Lookaside Buffer (TLB) by using larger sized memory pages. Due to databases having Sparse Memory Access patterns instead of Contiguous Memory access patterns, database workloads often perform poorly when Transparent Huge Pages is enabled. To improve performance of PostgreSQL, disable Transparent Huge Pages. In deployments where the PostgreSQL database is running on a separate server, there may be a small benefit to using Transparent Huge Pages on the orcharhino Server only.

Puma Tunings

Puma is a ruby application server which is used for serving the Foreman related requests to the clients. For any orcharhino configuration that is supposed to handle a large number of clients or frequent operations, it is important for the Puma to be tuned appropriately.

Puma Threads

Fewer Puma threads lead to higher memory usage for different scales on your orcharhino Server.

For example, we have compared these two setups:

orcharhino VM with 8 CPUs, 40 GiB RAM orcharhino VM with 8 CPUs, 40 GiB RAM

--foreman-foreman-service-puma-threads-min=0

--foreman-foreman-service-puma-threads-min=16

--foreman-foreman-service-puma-threads-max=16

--foreman-foreman-service-puma-threads-max=16

--foreman-foreman-service-puma-workers=2

--foreman-foreman-service-puma-workers=2

Setting the minimum Puma threads to 16 results in about 12% less memory as compared to t_min=0.

Puma Workers and Threads Auto-Tuning

If you do not provide any Puma workers and thread values with foreman-installer or they are not present in your orcharhino configuration, the foreman-installer configures a balanced number of workers. It follows this formula:

min(CPU*1.5, RAM_IN_GB - 1.5)

This is too much with regards to memory. There have been cases where too many workers triggered OOM on orcharhino.

This should be fine for most cases, but with some usage patterns tuning is needed to either limit the amount of resources dedicated to Puma (so other orcharhino components can use these) or for any other reason. Each Puma worker consumes around 1 GiB of RAM.

View your current orcharhino Server settings
# cat /etc/systemd/system/foreman.service.d/installer.conf
View the currently active Puma workers
# systemctl status foreman.service

Puma Workers and Threads Recommendations

In order to recommend thread and worker configurations for the different tuning profiles, we conducted Puma tuning testing on orcharhino with different tuning profiles and the main test run performed in this testing was concurrent registration with the following combinations along with different workers and threads. Our recommendation is based purely on concurrent registration performance, so it might not reflect your exact use-case. For example, if your setup is very content oriented with lots of publishes and promotes, you might want to limit resources consumed by Puma in favor of Pulp and PostgreSQL.

Name Number of managed host RAM Cores Recommended Puma Threads for both min & max Recommended Puma Workers

default

0-5000

20 GiB

4

16

4-6

medium

5000-10000

32 GiB

8

16

8-12

large

10000-20000

64 GiB

16

16

12-18

extra-large

20000-60000

128 GiB

32

16

16-24

extra-extra-large

60000+

256 GiB+

48+

16

20-26

Configuring Puma Threads

More workers allow for lower time to register hosts in parallel. For example, we have compared these two setups:

orcharhino VM with 8 CPUs, 40 GiB RAM orcharhino VM with 8 CPUs, 40 GiB RAM

--foreman-foreman-service-puma-threads-min=16

--foreman-foreman-service-puma-threads-min=8

--foreman-foreman-service-puma-threads-max=16

--foreman-foreman-service-puma-threads-max=8

--foreman-foreman-service-puma-workers=2

--foreman-foreman-service-puma-workers=4

Using more workers and the same total number of threads results in about 11% of speedup in highly concurrent registrations scenario. Moreover, adding more workers did not consume more CPU and RAM but gets more performance.

Configuring Puma DB Pool

The effective value of $db_pool is automatically set to equal $foreman::foreman_service_puma_threads_max. It is the maximum of $foreman::db_pool and $foreman::foreman_service_puma_threads_max but both have default value 5, so any increase to the max threads above 5 automatically increases the database connection pool by the same amount

Tuning number of workers is the more important aspect here and in some case we have seen up to 52% performance increase. Although installer uses 5 min/max threads by default, we recommend 16 threads in with all the tuning profiles in the table above. That is because we have seen up to 23% performance increase with 16 threads (14% for 8 compared and 10% for 32 compared) when compared to setup with 4 threads.

To get these numbers we used concurrent registrations test case which is a very specific use-case. It can be different on your orcharhino which might have more balanced use-case (not only registrations). Keeping default 5 min/max threads is a good choice as well.

These are our some measurements that lead us to these recommendations:

4 workers, 4 threads 4 workers, 8 threads 4 workers, 16 threads 4 workers, 32 threads

Improvement

0%

14%

23%

10%

Use 4 - 6 workers on a default setup (4 CPUs) - we have seen about 25% higher performance with 5 workers when compared to 2 workers, but 8% lower performance with 8 workers when compared to 2 workers - see table below:

2 workers, 16 threads 4 workers, 16 threads 6 workers, 16 threads 8 workers, 16 threads

Improvement

0%

26%

22%

-8%

Use 8 - 12 workers on a medium setup (8 CPUs) - see table below:

2 workers, 16 threads 4 workers, 16 threads 8 workers, 16 threads 12 workers, 16 threads 16 workers, 16 threads

Improvement

0%

51%

52%

52%

42%

Use 16 - 24 workers on a 32 CPUs setup (this was tested on a 90 GiB RAM machine and memory turned out to be a factor here as system started swapping - proper extra-large should have 128 GiB), higher number of workers was problematic for higher registration concurrency levels we tested, so we cannot recommend it.

4 workers, 16 threads 8 workers, 16 threads 16 workers, 16 threads 24 workers, 16 threads 32 workers, 16 threads 48 workers, 16 threads

Improvement

0%

37%

44%

52%

too many failures

too many failures

Apache HTTPD Performance Tuning

Apache httpd forms a core part of the orcharhino and acts as a web server for handling the requests that are being made through the orcharhino management UI or exposed APIs. To increase the concurrency of the operations, httpd forms the first point where tuning can help to boost the performance of your orcharhino.

Configuring how many processes can be launched by Apache httpd

By default, HTTPD uses prefork request handling mechanism. With the prefork model of handling the requests, httpd launches a new process to handle the incoming connection by the client.

When the number of requests to the apache exceed the maximum number of child processes that can be launched to handle the incoming connections, an HTTP 503 Service Unavailable Error is raised by httpd. Amidst httpd running out of processes to handle, the incoming connections can also result in multiple component failure on your orcharhino side due to the dependency of components like Pulp on the availability of httpd processes.

You can adapt the configuration of HTTPD prefork to handle more concurrent requests based on your expected peak load.

An example modification to the prefork configuration for a server which may desire to handle 150 concurrent content host registrations to orcharhino may look like the configuration file example that follows (see how to use custom-hiera.yaml file; this will modify config file /etc/httpd/conf.modules.d/prefork.conf):

You can modify /etc/foreman-installer/custom-hiera.yaml:

apache::mod::prefork::serverlimit: 582
apache::mod::prefork::maxclients: 582
apache::mod::prefork::startservers: 10
  • Set the ServerLimit parameter to raise MaxClients value.

    For more information, see ServerLimit Directive in the httpd documentation.

  • Set the MaxClients parameter to limit the the maximum number of child processes that httpd can launch to handle the incoming requests.

    For more information, see MaxRequestWorkers Directive in the httpd documentation.

Configuring the Open Files Limit for Apache HTTPD

With the tuning in place, Apache httpd can easily open a lot of file descriptors on the server which may exceed the default limit of most of the linux systems in place. To avoid any kind of issues that may arise as a result of exceeding max open files limit on the system, please create the following file and directory and set the contents of the file as specified in the below given example:

Procedure
  1. Set the maximum open files limit in /etc/systemd/system/httpd.service.d/limits.conf:

    [Service]
    LimitNOFILE=640000
  2. Restart the orcharhino services:

    # systemctl daemon-reload
    # systemctl restart httpd

qdrouterd and qpid Tuning

Calculating the maximum open files limit for qdrouterd

In deployments using katello-agent infrastructure with a large number of content hosts, it may be necessary to increase the maximum open files for qdrouterd.

Calculate the limit for open files in qdrouterd using this formula: (N x 3) + 100, where N is the number of content hosts. Each content host may consume up to three file descriptors in the router and 100 filedescriptors are required to run the router itself.

The following settings permit orcharhino to scale up to 10.000 content hosts.

Procedure
  1. Set the maximum open files limit in /etc/foreman-installer/custom-hiera.yaml:

    qpid::router::open_file_limit: "My_Value"

    The default value is 150100.

  2. Apply your changes:

    # foreman-installer

Calculating the maximum open files limit for qpidd

In deployments using katello-agent infrastructure with a large number of content hosts, it may be necessary to increase the maximum open files for qpidd.

Calculate the limit for open files in qpidd using this formula: (N x 4) + 500, where N is the number of content hosts. A single content host can consume up to four file descriptors and 500 file descriptors are required for the operations of Broker (a component of qpidd).

Procedure
  1. Set the maximum open files limit in /etc/foreman-installer/custom-hiera.yaml:

    qpid::open_file_limit: "My_Value"

    The default value is 65536.

  2. Apply the changes:

    # foreman-installer

Configuring the Maximum Asynchronous Input-Output Requests

In deployments using katello-agent infrastructure with a large number of content hosts, it may be necessary to increase the maximum allowable concurrent AIO requests. You can increase the maximum number of allowed concurrent AIO requests by increasing the kernel parameter fs.aio-max-nr.

Procedure
  1. Set the value of fs.aio-max-nr to the desired maximum in a file in /etc/sysctl.d:

    fs.aio-max-nr=My_Maximum_Concurrent_AIO_Requests

    Ensure this number is bigger than 33 multiplied by the maximum number of content hosts you plan to register to orcharhino.

  2. Apply the changes:

    # sysctl -p
  3. Optional: Reboot your orcharhino Server to ensure that this change is applied.

Storage Considerations

Ensure you provide enough storage space for /var/lib/qpidd in advance when you are planning an installation that will use katello-agent extensively. On orcharhino Server, /var/lib/qpidd requires 2MiB disk space per content host.

Configuring the QPID mgmt-pub-interval Parameter

You might see the following error in journal (use journalctl command to access it) in CentOS 7:

orcharhino.example.com qpidd[92464]: [Broker] error Channel exception: not-attached: Channel 2 is not attached(/builddir/build/BUILD/qpid-cpp-0.30/src/qpid/amqp_0_10/SessionHandler.cpp: 39
orcharhino.example.com qpidd[92464]: [Protocol] error Connectionqpid.10.1.10.1:5671-10.1.10.1:53790 timed out: closing

This error message appears because qpid maintains management objects for queues, sessions, and connections and recycles them every ten seconds by default. The same object with the same ID is created, deleted, and created again. The old management object is not yet purged, which is why qpid throws this error.

Procedure
  1. Set the mgmt-pub-interval parameter in /etc/foreman-installer/custom-hiera.yaml:

    qpid::mgmt_pub_interval: 5
  2. Apply your changes:

    # foreman-installer

Dynflow Tuning

Dynflow is the workflow management system and task orchestrator which is a orcharhino plugin and is used to execute the different tasks of orcharhino in an out-of-order execution manner. Under the conditions when there are a lot of clients checking in on orcharhino and running a number of tasks, Dynflow can take some help from an added tuning specifying how many executors can it launch.

For more information about the tunings involved related to Dynflow, see https://orcharhino.example.com/foreman_tasks/sidekiq.

Increase sidekiq workers

orcharhino contains a Dynflow service called dynflow-sidekiq that performs tasks scheduled by Dynflow. Sidekiq workers can be grouped into various queues to ensure lots of tasks of one type will not block execution of tasks of other type.

ATIX AG recommends to increase the number of sidekiq workers to scale the Foreman tasking system for bulk concurrent tasks, for example for multiple Content View publications and promotions, content synchronizations, and synchronizations to orcharhino Proxys. There are two options available:

  • You can increase the number of threads by a dynflow worker (worker’s concurrency). This has limited impact for values larger than five due to Ruby implementation of the concurrency of threads.

  • You can increase the number of workers, which is recommended.

Procedure
  1. Increase the number of workers from one worker to three while remaining five threads/concurrency of each:

    # foreman-installer --foreman-dynflow-worker-instances 3    # optionally, add --foreman-dynflow-worker-concurrency 5
  2. Optional: Check if there are three worker services:

    # systemctl -a | grep dynflow-sidekiq@worker-[0-9]
    dynflow-sidekiq@worker-1.service        loaded    active   running   Foreman jobs daemon - worker-1 on sidekiq
    dynflow-sidekiq@worker-2.service        loaded    active   running   Foreman jobs daemon - worker-2 on sidekiq
    dynflow-sidekiq@worker-3.service        loaded    active   running   Foreman jobs daemon - worker-3 on sidekiq

PostgreSQL Tuning

PostgreSQL is the primary SQL based database that is used by orcharhino for the storage of persistent context across a wide variety of tasks that orcharhino does. The database sees an extensive usage is usually working on to provide the orcharhino with the data which it needs for its smooth functioning. This makes PostgreSQL a heavily used process which if tuned can have a number of benefits on the overall operational response of orcharhino.

You can apply a set of tunings to PostgreSQL to improve its response times, which will modify the postgresql.conf file.

Procedure
  1. Append /etc/foreman-installer/custom-hiera.yaml to tune PostgreSQL:

    postgresql::server::config_entries:
      max_connections: 1000
      shared_buffers: 2GB
      work_mem: 8MB
      autovacuum_vacuum_cost_limit: 2000

    You can use this to effectively tune down your orcharhino instance irrespective of a tuning profile.

  2. Rerun the foreman-installer.

    # foreman-installer

In the above tuning configuration, there are a certain set of keys which we have altered:

  • max_connections: The key defines the maximum number of connections that can be accepted by the PostgreSQL processes that are running.

  • shared_buffers: The shared buffers define the memory used by all the active connections inside postgresql to store the data for the different database operations. An optimal value for this will vary between 2 GiB to a maximum of 25% of your total system memory depending upon the frequency of the operations being conducted on orcharhino.

  • work_mem: The work_mem is the memory that is allocated on per process basis for Postgresql and is used to store the intermediate results of the operations that are being performed by the process. Setting this value to 8 MB should be more than enough for most of the intensive operations on orcharhino.

  • autovacuum_vacuum_cost_limit: The key defines the cost limit value for the vacuuming operation inside the autovacuum process to clean up the dead tuples inside the database relations. The cost limit defines the number of tuples that can be processed in a single run by the process. ATIX AG recommends setting the value to 2000 as it is for the medium, large, extra-large, and extra-extra-large profiles, based on the general load that orcharhino pushes on the PostgreSQL server process.

Benchmarking raw DB performance

PGbench utility (note you may need to resize PostgreSQL data directory /var/opt/rh/rh-postgresql12/lib/pgsql/ directory to 100 GiB or what does benchmark take to run) might be used to measure PostgreSQL performance on your system. Use yum install postgresql-contrib to install it.

Choice of filesystem for PostgreSQL data directory might matter as well.

  • Never do any testing on production system and without valid backup.

  • Before you start testing, see how big the database files are. Testing with a really small database would not produce any meaningful results. For example, if the DB is only 20 GiB and the buffer pool is 32 GiB, it won’t show problems with large number of connections because the data will be completely buffered.

orcharhino Proxy Configuration Tuning

orcharhino Proxies are meant to offload part of orcharhino load and provide access to different networks related to distributing content to clients but they can also be used to execute remote execution jobs. What they cannot help with is anything which extensively uses orcharhino API as host registration or package profile update.

orcharhino Proxy Performance Tests

We have measured multiple test cases on multiple orcharhino Proxy configurations:

orcharhino Proxy HW configuration CPUs RAM

minimal

4

12 GiB

large

8

24 GiB

extra large

16

46 GiB

Content delivery use case

In a download test where we concurrently downloaded a 40MB repo of 2000 packages on 100, 200, .. 1000 hosts, we saw roughly 50% improvement in average download duration every time when we doubled orcharhino Proxy resources. For more percise numbers, see the table below.

Concurrent downloading hosts Minimal (4 CPU and 12 GiB RAM) → Large (8 CPU and 24 GiB RAM) Large (8 CPU and 24 GiB RAM) → Extra Large (16 CPU and 46 GiB RAM) Minimal (4 CPU and 12 GiB RAM) → Extra Large (16 CPU and 46 GiB RAM)

Average Improvement

~ 50% (e.g. for 700 concurrent downloads in average 9 seconds vs. 4.4 seconds per package)

~ 40% (e.g. for 700 concurrent downloads in average 4.4 seconds vs. 2.5 seconds per package)

~ 70% (e.g. for 700 concurrent downloads in average 9 seconds vs. 2.5 seconds per package)

When we compared download performance from orcharhino Server vs. from orcharhino Proxy, we have seen only about 5% speedup, but that is expected as orcharhino Proxy’s main benefit is in getting content closer to geographically distributed clients (or clients in different networks) and in handling part of the load orcharhino Server would have to handle itself. In some smaller hardware configurations (8 CPUs and 24 GiB), orcharhino Server was not able to handle downloads from more than 500 concurrent clients, while a orcharhino Proxy with the same hardware configuration was able to service more than 1000 and possibly even more.

Frequent registrations use case

For concurrent registrations a bottleneck is CPU speed, but all configs were able to handle even high concurrency without swapping. Hardware resources used for orcharhino Proxy have only minimal impact on registration performance. For example, orcharhino Proxy with 16 CPUs and 46 GiB RAM have at most a 9% registration speed improvement when compared to a orcharhino Proxy with 4 CPUs and 12 GiB RAM.

Remote execution use case

We have tested executing Remote Execution jobs via both SSH and Ansible backend on 500, 2000 and 4000 hosts. All configurations were able to handle all of the tests without errors, except for the smallest configuration (4 CPUs and 12 GiB memory) which failed to finish on all 4000 hosts.

In a sync test where we synced CentOS 6, 7, 8 BaseOS and 8 AppStream we have not seen significant differences amongst orcharhino Proxy configurations. This will be different for syncing a higher number of content views in parallel.

The text and illustrations on this page are licensed by ATIX AG under a Creative Commons Attribution–Share Alike 3.0 Unported ("CC-BY-SA") license. This page also contains text from the official Foreman documentation which uses the same license ("CC-BY-SA").