Skip to content

Conversation

@Kami
Copy link
Member

@Kami Kami commented Feb 13, 2019

This pull request updates service bootstrap phase (st2common.service_setup.common_setup function) to register every StackStorm service in the service registry.

Background and Context

Registering each service in a service segistry will allow us to have a better overview of the active and running services. This will allow us to implement things such as more dynamic (utilization and capability based) action execution scheduling / routing and more.

The goal was so it also works outside of the Kubernetes context in a traditional HA environment. It's also worth noting that Kubernetes has it's own service registry concept, but that's based on a slightly higher level Kubernetes view of a service (two related concepts, but not exactly the same).

Implementation Details

Implementation utilizes group membership primitives from the OpenStack tooz library.

There were multiple reasons (as discussed before) for utilizing tooz library:

  1. Consistency

We already use it in other places for distributed locking.

  1. Support for multiple drivers / adapters

We will officially probably only support a single backend (e.g. ZooKeeper or etcd), but people will still be able to user other tooz supported backends at their own risk and this will also help with testing.

Keep in mind that only the following tooz drivers support group membership primitives which we utilize: zookeeper, redis, etcd3, etcd3gw, memcached.

To be able to implement it, I needed to change coordinator_setup and get_coordinator function signature so we can pass start_heartbeat argument to it. This way coordination service automatically starts periodic heartbeats which tell the backend that a particular service (member) is still alive.

In addition to that, I added tools/list_group_members.py script which allows user to list active services.

Terminology

OpenStack tooz library exposes a couple of different terms. Here I describe how they fit into our implementation.

  1. Group

Each service is part of a single group. In our case, group represents a service type (e.g. api, actionrunner, scheduler, etc.). To view all the running services of a particular type, we just need to list all the members in a particular group.

  1. Member

A service is represented as a member which is part of a particular group.

  1. Capabilities

Optional metadata associated with a service (service name, listen host / port, service type, hostname, pid and other info). In the future, action runners will also utilize this to advertise their capabilities (e.g. linux, windows, etc.).

Example Output

Example output of tools/list_group_members.py script when all the services are running using launchdev.sh script (to test it locally, I used the redis backend).

$ python ./tools/list_group_members.py --config-file conf/st2.dev.conf 
Available groups (13):
 - actionrunner
 - garbagecollector
 - workflow_engine
 - resultstracker
 - stream
 - timer_engine
 - auth
 - sensorcontainer
 - api
 - scheduler
 - rulesengine
 - notifier

Members in group "actionrunner" (1):
 - localhost_21837 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21837, 'name': 'actionrunner'})
Members in group "garbagecollector" (1):
 - localhost_21841 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21841, 'name': 'garbagecollector'})
Members in group "workflow_engine" (1):
 - localhost_21834 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21834, 'name': 'workflowengine'})
Members in group "resultstracker" (1):
 - localhost_21871 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21871, 'name': 'resultstracker'})
Members in group "stream" (1):
 - localhost_21930 (capabilities={'listen_host': '127.0.0.1', 'name': 'stream', 'listen_port': 9102, 'hostname': 'localhost', 'pid': 21930, 'type': 'active'})
Members in group "timer_engine" (1):
 - localhost_21853 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21853, 'name': 'timerengine'})
Members in group "auth" (1):
 - localhost_21927 (capabilities={'listen_host': '127.0.0.1', 'name': 'auth', 'listen_port': 9100, 'listen_ssl': False, 'hostname': 'localhost', 'pid': 21927, 'type': 'active'})
Members in group "sensorcontainer" (1):
 - localhost_21849 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21849, 'name': 'sensorcontainer'})
Members in group "api" (1):
 - localhost_21925 (capabilities={'listen_host': '127.0.0.1', 'name': 'api', 'listen_port': 9101, 'hostname': 'localhost', 'pid': 21925, 'type': 'active'})
Members in group "scheduler" (1):
 - localhost_21846 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21846, 'name': 'scheduler'})
Members in group "rulesengine" (1):
 - localhost_21850 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21850, 'name': 'rulesengine'})
Members in group "notifier" (1):
 - localhost_21870 (capabilities={'type': 'passive', 'hostname': 'localhost', 'pid': 21870, 'name': 'notifier'})

TODO

  • Changelog entry
  • Tests

Resolves #4015.

@Kami Kami added feature HA StackStorm in High Availability labels Feb 13, 2019
@Kami Kami added this to the 3.0.0 milestone Feb 13, 2019
@Kami Kami requested a review from m4dcoder February 13, 2019 13:24
m4dcoder and others added 2 commits February 13, 2019 14:58
Add or move the parsing of test configs to the top of affected test modules and make sure the scheduler default config options do not conflict with test configs.
Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Good Job @Kami. I assume unit tests are coming.

do_register_cli_opts(cli_opts)
config.parse_args()

main(group_id=cfg.CONF.group_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an admin only API endpoint and a command in the st2 CLI to list services, members and capabilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea - I was thinking about it.

I will add it and also add a corresponding CLI command.

metrics_initialize()

# Register service in the service registry
if service_registry:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a config check here? If user specifies a backend with no service registry support, we log and warn users and start st2 w/o service registry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you address this? Or is this a bad idea?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already log a warning when get_coordinator is called inside register_service_in_service_registry function.

@Kami Kami changed the title [WIP] Register services in service registry during the service bootstrap phase Register services in service registry during the service bootstrap phase Feb 15, 2019
@nmaludy
Copy link
Member

nmaludy commented Feb 19, 2019

Should the heartbeat be started for the coordinator/locking driver instances? I believe in some implementations that if locks don't get a heartbeat then they are released.

@Kami
Copy link
Member Author

Kami commented Mar 3, 2019

@nmlaudy I think you are correct, good catch 👍

I will also make that change, it will likely require test changes though. I already have some issues with this change (tests are getting stuck, even though I correctly shut down the coordinator everywhere, could also be a bug in a test driver we use).

@Kami
Copy link
Member Author

Kami commented Mar 11, 2019

Just a heads up - I also pushed a change so we start heartbeat process also in places where we use coordination backend only for locking - 98b77d1.

This is a safer default since it means it will work correctly with backends which are timeout based.

@nmaludy
Copy link
Member

nmaludy commented Mar 12, 2019

Closes #4015

Copy link
Contributor

@m4dcoder m4dcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall but got some minor things that I like you to address.

CHANGELOG.rst Outdated
* Fix improper CORS where request from an origin not listed in ``allowed_origins`` will be responded
with ``null`` for the ``Access-Control-Allow-Origin`` header. The fix returns the first of our
allowed origins if the requesting origin is not a supported origin. Reported by Barak Tawily.
Bllowed origins if the requesting origin is not a supported origin. Reported by Barak Tawily.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix typo here.

coordinator = coordination.get_coordinator()

group_ids = list(coordinator.get_groups().get())
group_ids = [group_id_.decode('utf-8') for group_id_ in group_ids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the underscore at the end of group_id_ mean something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, group_id variable is already defined in the method signature and I don't want to re-define / overwrite it in a list comprehension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practice wise, we don't usually put an underscore at the end of a variable. This is confusing and error prone. Can you rename group_id_ to something else? Maybe item just for this list comprehension.


class ServiceRegistry(core.Resource):
_alias = 'service-registry'
_display_name = 'service registry'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capitalize first letters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This attribute is really just a place holder, it's not used anywhere.

The class (model) is just needed to get st2 service-registry foo bar command structure to work since our, not so ideal CLI code, depends on it.

I can remove unused attributes all together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok these attributes are still here. So, placeholder or not, please capitalize service registry to Service Registry. This is so the values are consistent with other models.

metrics_initialize()

# Register service in the service registry
if service_registry:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you address this? Or is this a bad idea?

return True


class NoOpAsyncResult(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide a short description why we need to wrap result with this class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to correctly mimic tooz API - most of the tooz methods return an async result (future).

I can add that as a docstring.


if not use_cache:
coordinator = coordinator_setup(start_heart=start_heart)
return coordinator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return in one line? return coordinator_setup(start_heart=start_heart)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a personal preference - I can prefer return in a separate line.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. My suggestion here is because we are not doing anything with the variable coordinator so why not just return here. This is also 1 line of code less. What's the reasoning for your preference?

# Include common capabilities such as hostname and process ID
proc_info = system_info.get_process_info()
capabilities['hostname'] = proc_info['hostname']
capabilities['pid'] = proc_info['pid']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't get_member_id already included this info? Maybe doesn't matter much, but seems like we're calling system_info.get_process_info twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this function doesn't include member info. Just hostname and pid.

fi

VIRTUALENV=${VIRTUALENV_DIR:-${ST2_REPO}/virtualenv}
VIRTUALENV=$(readlink -f ${VIRTUALENV})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is to accommodate symlink in your dev env. Will this break existing setup for us?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's so an absolute path is used everywhere (just resolves relative path to an absolute one). Everything should still work the same.

ST2_CONF=${ST2_REPO}/conf/st2.dev.conf
fi

ST2_CONF=$(readlink -f ${ST2_CONF})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is to accommodate symlink in your dev env. Will this break existing setup for us?

coordinator = coordination.get_coordinator()

group_ids = list(coordinator.get_groups().get())
group_ids = [group_id_.decode('utf-8') for group_id_ in group_ids]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Practice wise, we don't usually put an underscore at the end of a variable. This is confusing and error prone. Can you rename group_id_ to something else? Maybe item just for this list comprehension.


class ServiceRegistry(core.Resource):
_alias = 'service-registry'
_display_name = 'service registry'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok these attributes are still here. So, placeholder or not, please capitalize service registry to Service Registry. This is so the values are consistent with other models.


if not use_cache:
coordinator = coordinator_setup(start_heart=start_heart)
return coordinator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. My suggestion here is because we are not doing anything with the variable coordinator so why not just return here. This is also 1 line of code less. What's the reasoning for your preference?

@Kami
Copy link
Member Author

Kami commented Mar 18, 2019

@m4dcoder

Practice wise, we don't usually put an underscore at the end of a variable. This is confusing and error prone. Can you rename group_id_ to something else? Maybe item just for this list comprehension.

That's a relatively common pattern in Python world. Also when variable names would clash with built-ins such as id, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature HA StackStorm in High Availability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants