-
-
Notifications
You must be signed in to change notification settings - Fork 782
Register services in service registry during the service bootstrap phase #4548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
members with their capabilities.
"service_setup" function.
Add or move the parsing of test configs to the top of affected test modules and make sure the scheduler default config options do not conflict with test configs.
m4dcoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall. Good Job @Kami. I assume unit tests are coming.
| do_register_cli_opts(cli_opts) | ||
| config.parse_args() | ||
|
|
||
| main(group_id=cfg.CONF.group_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add an admin only API endpoint and a command in the st2 CLI to list services, members and capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea - I was thinking about it.
I will add it and also add a corresponding CLI command.
| metrics_initialize() | ||
|
|
||
| # Register service in the service registry | ||
| if service_registry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a config check here? If user specifies a backend with no service registry support, we log and warn users and start st2 w/o service registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you address this? Or is this a bad idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already log a warning when get_coordinator is called inside register_service_in_service_registry function.
1. ``GET /v1/service_registry/groups`` - list available groups. 2. ``GET /v1/service_registry/groups/<group id>/members`` - list available / active members in a particular group. NOTE: This functionality is only accessible to RBAC admins.
|
Should the heartbeat be started for the coordinator/locking driver instances? I believe in some implementations that if locks don't get a heartbeat then they are released. |
|
@nmlaudy I think you are correct, good catch 👍 I will also make that change, it will likely require test changes though. I already have some issues with this change (tests are getting stuck, even though I correctly shut down the coordinator everywhere, could also be a bug in a test driver we use). |
use coordinator backend for locking purpose. This way it's safe and also works correctly with drivers which are timeout based.
|
Just a heads up - I also pushed a change so we start heartbeat process also in places where we use coordination backend only for locking - 98b77d1. This is a safer default since it means it will work correctly with backends which are timeout based. |
…es where we" This reverts commit 98b77d1.
|
Closes #4015 |
… in places where we"" This reverts commit f530cbe.
m4dcoder
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall but got some minor things that I like you to address.
CHANGELOG.rst
Outdated
| * Fix improper CORS where request from an origin not listed in ``allowed_origins`` will be responded | ||
| with ``null`` for the ``Access-Control-Allow-Origin`` header. The fix returns the first of our | ||
| allowed origins if the requesting origin is not a supported origin. Reported by Barak Tawily. | ||
| Bllowed origins if the requesting origin is not a supported origin. Reported by Barak Tawily. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix typo here.
| coordinator = coordination.get_coordinator() | ||
|
|
||
| group_ids = list(coordinator.get_groups().get()) | ||
| group_ids = [group_id_.decode('utf-8') for group_id_ in group_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the underscore at the end of group_id_ mean something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, group_id variable is already defined in the method signature and I don't want to re-define / overwrite it in a list comprehension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practice wise, we don't usually put an underscore at the end of a variable. This is confusing and error prone. Can you rename group_id_ to something else? Maybe item just for this list comprehension.
|
|
||
| class ServiceRegistry(core.Resource): | ||
| _alias = 'service-registry' | ||
| _display_name = 'service registry' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capitalize first letters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This attribute is really just a place holder, it's not used anywhere.
The class (model) is just needed to get st2 service-registry foo bar command structure to work since our, not so ideal CLI code, depends on it.
I can remove unused attributes all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok these attributes are still here. So, placeholder or not, please capitalize service registry to Service Registry. This is so the values are consistent with other models.
| metrics_initialize() | ||
|
|
||
| # Register service in the service registry | ||
| if service_registry: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you address this? Or is this a bad idea?
| return True | ||
|
|
||
|
|
||
| class NoOpAsyncResult(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a short description why we need to wrap result with this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to correctly mimic tooz API - most of the tooz methods return an async result (future).
I can add that as a docstring.
|
|
||
| if not use_cache: | ||
| coordinator = coordinator_setup(start_heart=start_heart) | ||
| return coordinator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return in one line? return coordinator_setup(start_heart=start_heart)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a personal preference - I can prefer return in a separate line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. My suggestion here is because we are not doing anything with the variable coordinator so why not just return here. This is also 1 line of code less. What's the reasoning for your preference?
| # Include common capabilities such as hostname and process ID | ||
| proc_info = system_info.get_process_info() | ||
| capabilities['hostname'] = proc_info['hostname'] | ||
| capabilities['pid'] = proc_info['pid'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't get_member_id already included this info? Maybe doesn't matter much, but seems like we're calling system_info.get_process_info twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this function doesn't include member info. Just hostname and pid.
| fi | ||
|
|
||
| VIRTUALENV=${VIRTUALENV_DIR:-${ST2_REPO}/virtualenv} | ||
| VIRTUALENV=$(readlink -f ${VIRTUALENV}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is to accommodate symlink in your dev env. Will this break existing setup for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's so an absolute path is used everywhere (just resolves relative path to an absolute one). Everything should still work the same.
| ST2_CONF=${ST2_REPO}/conf/st2.dev.conf | ||
| fi | ||
|
|
||
| ST2_CONF=$(readlink -f ${ST2_CONF}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is to accommodate symlink in your dev env. Will this break existing setup for us?
| coordinator = coordination.get_coordinator() | ||
|
|
||
| group_ids = list(coordinator.get_groups().get()) | ||
| group_ids = [group_id_.decode('utf-8') for group_id_ in group_ids] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Practice wise, we don't usually put an underscore at the end of a variable. This is confusing and error prone. Can you rename group_id_ to something else? Maybe item just for this list comprehension.
|
|
||
| class ServiceRegistry(core.Resource): | ||
| _alias = 'service-registry' | ||
| _display_name = 'service registry' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok these attributes are still here. So, placeholder or not, please capitalize service registry to Service Registry. This is so the values are consistent with other models.
|
|
||
| if not use_cache: | ||
| coordinator = coordinator_setup(start_heart=start_heart) | ||
| return coordinator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. My suggestion here is because we are not doing anything with the variable coordinator so why not just return here. This is also 1 line of code less. What's the reasoning for your preference?
That's a relatively common pattern in Python world. Also when variable names would clash with built-ins such as |
This pull request updates service bootstrap phase (
st2common.service_setup.common_setupfunction) to register every StackStorm service in the service registry.Background and Context
Registering each service in a service segistry will allow us to have a better overview of the active and running services. This will allow us to implement things such as more dynamic (utilization and capability based) action execution scheduling / routing and more.
The goal was so it also works outside of the Kubernetes context in a traditional HA environment. It's also worth noting that Kubernetes has it's own service registry concept, but that's based on a slightly higher level Kubernetes view of a service (two related concepts, but not exactly the same).
Implementation Details
Implementation utilizes group membership primitives from the OpenStack tooz library.
There were multiple reasons (as discussed before) for utilizing tooz library:
We already use it in other places for distributed locking.
We will officially probably only support a single backend (e.g. ZooKeeper or etcd), but people will still be able to user other tooz supported backends at their own risk and this will also help with testing.
Keep in mind that only the following tooz drivers support group membership primitives which we utilize: zookeeper, redis, etcd3, etcd3gw, memcached.
To be able to implement it, I needed to change
coordinator_setupandget_coordinatorfunction signature so we can passstart_heartbeatargument to it. This way coordination service automatically starts periodic heartbeats which tell the backend that a particular service (member) is still alive.In addition to that, I added
tools/list_group_members.pyscript which allows user to list active services.Terminology
OpenStack tooz library exposes a couple of different terms. Here I describe how they fit into our implementation.
Each service is part of a single group. In our case, group represents a service type (e.g.
api,actionrunner,scheduler, etc.). To view all the running services of a particular type, we just need to list all the members in a particular group.A service is represented as a member which is part of a particular group.
Optional metadata associated with a service (service name, listen host / port, service type, hostname, pid and other info). In the future, action runners will also utilize this to advertise their capabilities (e.g. linux, windows, etc.).
Example Output
Example output of
tools/list_group_members.pyscript when all the services are running using launchdev.sh script (to test it locally, I used the redis backend).TODO
Resolves #4015.