Graylog is a centralized log management solution to capture, store and analyze log files in real-time. Starting with the latest minor release 4.3 Graylog announced to no longer support Elasticsearch (ES) due to licensing and structural changes Elastic introduced in v7.11. For this reason, the last supported ES version is 7.10, which has already reached EOL on May 11, 2022.
Fortunately, Graylog also knows this and recommends users to switch even if it is currently not enforced and ES 7.10 continues to work for now 1 . As you usually don’t wants to operate software that no longer receives security updates, I have started to look into a migration and prepared the Container and Ansible setup. My fist mistake on this journey was to believe I can just use the latest Opensearch (OS) release. Had I read the documentation 2 more carefully I would have saved myself a lot of trouble…
Anyway, the actual migration of the cluster from ES v7.10 to OS v2.1 succeeded surprisingly smoothly. Well, almost, after all I had to rewrite the complete Ansible role because OS 2.x has changed almost all configuration parameters and API calls 🎉 But as you can imagine, everything explodes while trying to start Graylog again. Dang. Just downgrading Opensearch was also not possible as the cluster and all indices were migrated successfully already. To get it back in a working state I decided to reset the entire cluster and restore the snapshots from the S3 backup repository before I start to start a next try, this time with a supported OS 1.x version 🤞 At least I have already completed the ES disaster recovery test for this year.
- Read documentations/upgrade instructions more carefully
- Ensure to have a working backup
- Test your recovery process frequently to stay calm and comfortable in case of an emergency
- Test upgrades in a staging environment whenever possible
What annoys me a bit about the whole situation is the back and forth and the rather bad communication in the past from Graylog 3 . Furthermore, the situation with Opensearch is not really better, as it is unclear if e.g. version 1.3 is still supported or not and a general lifecycle information is still missing 4 .