activator for 0->1 case by akyyy · Pull Request #341 · knative/serving

akyyy · 2018-03-09T22:22:37Z

yanweiguo · 2018-03-09T22:41:43Z

+kind: Pod
+metadata:
+  name: ela-activator
+  # TODO: use namespace: ela-system for activator


Just curious: why can't we use namespace ela-system right now?

There are two reasons:

the ingress object we created has to be in the same namespace as the revision.

the ingress object which defines the backend service (activator-service in this case) has to be in the same namespace as the backend service.
Cross-namespace Ingress kubernetes/kubernetes#17088

It seems the activator is similar to autoscaler on this. #324

I thought in Istio route rules you specified both a service name and its namespace... I may be misremembering.

https://github.com/elafros/elafros/blob/4865b17e86a07bdafa886316b121163da179222d/pkg/controller/route/ela_ingress.go#L62
The ingress namespace needs to be the same as the revision, which is default.

I saw your link for Istio service namespace. I'll try it in the next change. Thanks!

mattmoor · 2018-03-10T04:53:45Z

@@ -0,0 +1,159 @@
+package main


split main package into //cmd/activator/

mattmoor · 2018-03-10T04:54:49Z

+
+// A layer on top of http.DefaultTransport, with retries.
+// Forked from https://github.com/fission/fission/blob/746c51901da590cff09317dbe59aa19241211812/router/functionHandler.go#L53
+type RetryingRoundTripper struct {


put this into its own package?

I put this in its own file.

mattmoor · 2018-03-10T04:55:56Z

 			istiov1alpha2.DestinationWeight{
 				Destination: istiov1alpha2.IstioService{
-					Name: route.Service,
+					Name: "activator-service.default.svc.cluster.local",


This probably shouldn't hardcode "default"

mattmoor · 2018-03-10T04:56:29Z

+	}
+	kc, err := kubernetes.NewForConfig(clusterConfig)
+	if err != nil {
+		panic(err)


glog.Fatalf

mattmoor · 2018-03-10T04:58:41Z

+kind: Pod
+metadata:
+  name: ela-activator
+  # TODO: use namespace: ela-system for activator


I thought in Istio route rules you specified both a service name and its namespace... I may be misremembering.

mattmoor · 2018-03-10T04:59:02Z

+			return resp, nil
+		}
+		timeout *= time.Duration(2)
+		log.Printf("Retrying request to %v in %v", req.URL.Host, timeout)


mattmoor · 2018-03-10T05:00:27Z

+	log.Println("Sending a proxy request to", targetURL)
+	target, err := url.Parse(targetURL)
+	if err != nil {
+		panic(err.Error())


glog.Fatalf, although I'm worried about the implications of that.

mattmoor · 2018-03-10T05:04:20Z

+			destinationWeights = append(destinationWeights,
+				istiov1alpha2.DestinationWeight{
+					Destination: istiov1alpha2.IstioService{
+						Name: route.Service,


https://istio.io/docs/reference/config/istio.routing.v1alpha1.html#IstioService

I see a "namespace" here?

ah, thanks for the pointer, that may help. I'll try it in the next change.

mattmoor

Just to manage my own expectations: this is WIP, and some of the TODOs will be done before this goes in, right?

Is so, let's change the title to [WIP] title so it's clear to reviewers, and remove that when we think it's complete.

Does that SGTY?

mattmoor · 2018-03-12T18:34:37Z

+
+	a, err := activator.NewActivator(kubeClient, elaClient)
+	if err != nil {
+		glog.Fatal(err)


Add a message with context?

mattmoor · 2018-03-12T18:35:12Z

+    srcs = [
+        "activator.go",
+        "retryingtripper.go",
+        ],


nit: buildify your BUILD files.

mattmoor · 2018-03-12T18:35:57Z

-	}
-	// finally, one more retry with the default timeout
-	return http.DefaultTransport.RoundTrip(req)
+// NewActivator returns an activator.


nit: capitalize Activator

mattmoor · 2018-03-12T18:38:28Z

 	target, err := url.Parse(targetURL)
 	if err != nil {
-		panic(err.Error())
+		glog.Fatal(err.Error())


Can we avoid glog.Fatalf and panic in libraries? I'd expect this to propagate the error and for the caller to handle it appropriately.

Especially since Activator is a shared component, I worry about one tenant being able to DoS other tenants by crash looping the activator pods.

mattmoor · 2018-03-12T18:39:38Z

+	revisionClient := a.elaClient.ElafrosV1alpha1().Revisions("default")
 	// TODO: determine the revision to be activated.
-	revision, err := revisionClient.Get("p-d22006ac-ef13-45cb-ae6a-069dfe2c81a5", metav1.GetOptions{})
+	revision, err := revisionClient.Get("p-c106f1fa-082e-41ac-a506-a9c1016dec77", metav1.GetOptions{})


um.... LOL. I didn't notice this before.

mattmoor · 2018-03-12T18:49:02Z

+// A layer on top of http.DefaultTransport, with retries.
+// Forked from https://github.com/fission/fission/blob/746c51901da590cff09317dbe59aa19241211812/router/functionHandler.go#L53
+type RetryingRoundTripper struct {
+	maxRetries     int


mattmoor · 2018-03-12T18:50:28Z

+
+func (rrt RetryingRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {
+	timeout := rrt.initialTimeout
+	transport := http.DefaultTransport.(*http.Transport)


Please construct the above with an inner transport vs. assuming DefaultTransport.

Do you mean add a field of transport in struct RetryingRoundTripper?

I think RetryingRoundTripper should have an inner http.RoundTripper

mattmoor · 2018-03-12T18:54:40Z

+		transport.DialContext = (&net.Dialer{
+			Timeout:   timeout,
+			KeepAlive: 30 * time.Second,
+		}).DialContext


Whoa, tampering with the settings of the DefaultTransport feels super sketchy.

I also don't like that we can't operate on an http.RoundTripper and need an *http.Transport.

Is there a reason that the timeout manipulation needs to involve DialContext?

It seems timeout is a field in transport.DialContext?

mattmoor · 2018-03-12T18:55:25Z

+		if err == nil {
+			return resp, nil
+		}
+		timeout *= time.Duration(2)


I think this is doubling the timeout each iteration, but I find the fact that it's going through time.Duration(2) to make this harder to read than simply writing: timeout = timeout + timeout. WDYT?

mattmoor · 2018-03-12T18:56:02Z

+		}
+		timeout *= time.Duration(2)
+		glog.Info("Retrying request to %v in %v", req.URL.Host, timeout)
+		time.Sleep(timeout)


This is actually how I would have expected the timeout logic to creep up over time. See my question above about DialContext.

mattmoor · 2018-03-14T20:54:11Z

+    image: ela-activator:latest
+    ports:
+    - name: http
+      containerPort: 8080


nit: newline

If this single Pod fail or is evicted, which happens regularly (e.g. cluster resizing), then it will not be restarted. Instead of creating a single pod, you should create a Deployment scaled to 1 which ensure the correct number of Pods remains running.

mattmoor · 2018-03-14T20:55:53Z

 	a, err := activator.NewActivator(kubeClient, elaClient)
 	if err != nil {
-		glog.Fatal(err)
+		glog.Fatalf("Failed to create an activator: %s", err.Error())


Drop the .Error() and switch %s to %v

mattmoor · 2018-03-14T20:59:56Z

+	glog.Info("Sending a proxy request to ", targetURL)
+	target, err := url.Parse(targetURL)
+	if err != nil {
+		glog.Errorf("Failed to parse target URL: %s. Error: %s", targetURL, err.Error())


Again, you don't need .Error(), use %v

mattmoor · 2018-03-14T21:33:37Z

 				Destination: istiov1alpha2.IstioService{
-					// TODO: update activator-service namespace to be ela-system.
-					Name: "activator-service.default.svc.cluster.local",
+					Name: controller.GetElaK8SActivatorServiceName(),


placeHolderK8SServiceName?

mattmoor · 2018-03-14T21:35:31Z

 	// Construct a hostname that the ingress accepts traffic for.
 	hostRule := u.Spec.DomainSuffix

+	ns := namespace


just overwrite namespace?

mattmoor · 2018-03-14T21:50:17Z

-		if err != nil {
-			glog.Info("Failed to update the revision: ", revision)
-		}
+		_, _ = a.retry(updateRevisionMaxRetries, 50*time.Millisecond, revision, a.updateRevision)


Don't ignore errors here.

mattmoor · 2018-03-14T21:51:28Z

 	case v1alpha1.RevisionServingStateActive:
 		// The revision is already active. Forward the request to k8s deployment.
-		serviceURL, err := a.getRevisionTargetURL(revision)
+		serviceURL, err := a.retry(getTargetURLMaxRetries, 50*time.Millisecond, revision, a.getRevisionTargetURL)


This should be done via Watch.

mattmoor · 2018-03-14T22:01:58Z

+	glog.Info("End proxy request")
+}
+
+func (a *Activator) retry(maxRetries int, sleep time.Duration, revision *v1alpha1.Revision, runner func(*v1alpha1.Revision) (string, error)) (string, error) {


Can we drop the retry logic from the first cut of this, and add it back once the base design is in place?

It is my expectation that the core of this will make use of Watch and shouldn't need retries. Right now it's using retry to poll IIUC, and that is more expensive and in my experience slower.

+1. The retries were originally part of making this work with Fission and maybe they aren't needed. Let's test it first.

mattmoor · 2018-03-14T22:02:35Z

 	revisionClient := a.elaClient.ElafrosV1alpha1().Revisions("default")
 	// TODO: determine the revision to be activated.
-	revision, err := revisionClient.Get("p-c106f1fa-082e-41ac-a506-a9c1016dec77", metav1.GetOptions{})
+	revision, err := revisionClient.Get("p-dd33338c-7389-4abe-8020-774c2d085d05", metav1.GetOptions{})


What is our plan for obtaining the revision ID?

mattmoor · 2018-03-14T22:03:23Z

 	default:
-		panic("unrecognized revision serving status: " + revision.Spec.ServingState)
+		glog.Fatal("unrecognized revision serving status: " + revision.Spec.ServingState)
 	}


This isn't exactly what I meant, but I'd like to see us drop the retry and get a solid core before they come back.

josephburnett · 2018-03-15T15:38:39Z

+    image: ela-activator:latest
+    ports:
+    - name: http
+      containerPort: 8080


If this single Pod fail or is evicted, which happens regularly (e.g. cluster resizing), then it will not be restarted. Instead of creating a single pod, you should create a Deployment scaled to 1 which ensure the correct number of Pods remains running.

josephburnett · 2018-03-15T15:41:15Z

@@ -0,0 +1 @@
+// TODO


You can always provide a link to the //pkg/autoscaler/README.md which explains the role of the activator. Would be better to have all the docs in one place anyway.

Added a link.

josephburnett · 2018-03-15T15:43:28Z

+	glog.Info("End proxy request")
+}
+
+func (a *Activator) retry(maxRetries int, sleep time.Duration, revision *v1alpha1.Revision, runner func(*v1alpha1.Revision) (string, error)) (string, error) {


+1. The retries were originally part of making this work with Fission and maybe they aren't needed. Let's test it first.

josephburnett · 2018-03-15T15:45:45Z

+func (a *Activator) getRevisionTargetURL(revision *v1alpha1.Revision) (string, error) {
+	services := a.kubeClient.CoreV1().Services(revision.GetNamespace())
+	// k8s service name
+	svc, err := services.Get(revision.GetName()+"-service", metav1.GetOptions{})


You can use GetElaK8SServiceNameForRevision instead of hardcoding "-service": https://github.com/elafros/elafros/blob/41acb4c66c1457b23472d4b65b1ca9eac1f1df40/pkg/controller/names.go#L53

josephburnett · 2018-03-15T15:46:25Z

+func (a *Activator) handler(w http.ResponseWriter, r *http.Request) {
+	revisionClient := a.elaClient.ElafrosV1alpha1().Revisions("default")
+	// TODO: determine the revision to be activated.
+	revision, err := revisionClient.Get("p-dd33338c-7389-4abe-8020-774c2d085d05", metav1.GetOptions{})


Work in progress?

josephburnett · 2018-03-15T15:48:00Z

+		proxyRequest(w, r, serviceURL)
+	case v1alpha1.RevisionServingStateReserve:
+		// The revision is inactive. Enqueue the request and activate the revision
+		// TODO: check if the route traffic (route.Spec.Traffic) contains the revision


Maybe you don't need to do this. The Route Controller (or someone) will make sure the Revision states are updated. You can just rely on the Revision states.

Removed TODO.

josephburnett · 2018-03-15T15:53:05Z

+	glog.Info("Found revision ", revision.GetName())
+	glog.Info("Start to proxy request...")
+	switch revision.Spec.ServingState {
+	case v1alpha1.RevisionServingStateActive:


The Revision may be Active, but the Service may not have any endpoints in it. There is a delay between when the Revision.Spec is changed to Active and when the controller creates the resources and they all come online.

So I would expect that requests which arrive immediately after the first one will take this code branch, but get a 5xx error instead of reaching a Pod through the Service.

I suggest you don't worry about that right now. Just get this working. But create a separate issue to close the gap. You can either 1) retry on 5xx requests to the Service or 2) proxy all requests directly to Pods you find through labels (my preference for simplicity).

I'd recommend Watching Revisions and the Revision's Status should have a Condition reflecting the readiness of the K8s Service.

Tracked in #552

Yeah, the Spec here is not what we want to key off and Revision status should indeed be what we use to determine the actual status of k8s service.

josephburnett · 2018-03-15T15:55:20Z

+		// TODO: check if the route traffic (route.Spec.Traffic) contains the revision
+		glog.Info("the revision is inactive. Activating it and enqueuing request")
+		revision.Spec.ServingState = v1alpha1.RevisionServingStateActive
+		_, _ = a.retry(updateRevisionMaxRetries, 50*time.Millisecond, revision, a.updateRevision)


If you get a large burst of traffic you will create a lock train on the Revision by trying to update it to Active for every request. Instead, you deduplicate activation requests.

Again, just take a note and fix this later. Get it working in a happy, single-request scenario first.

josephburnett · 2018-03-15T15:56:30Z

+	InitialTimeout time.Duration
+}
+
+func (rrt RetryingRoundTripper) RoundTrip(req *http.Request) (*http.Response, error) {


As mentioned above, this might not be necessary.

josephburnett · 2018-03-15T15:58:49Z

 	// By default we map to the placeholder service directly.
-	// This would point to 'router' component if we wanted to use
-	// this method for 0->1 case.
+	// This would point to 'activator' component if enableActivator is true.


Does enableActivator mean an experiment is on? Or that the revision is in Reserve state and should be routed to the Activator? The name should clarify.

The feature experiment flag is enableActivatorExperiment. By default it's false. When it's true, the traffic will be directed to activator-service. Right now activator uses the hard coded revision "configuration-example-00001". This will be addressed by #552.

josephburnett · 2018-03-30T21:45:41Z

 	}, []string{"status"})
+	// The experiment flag to turn on activator feature. If set to true, the traffic
+	// will be directed to activator-service.
+	enableActivatorExperiment = false


Nit: maybe you could say "...the traffic will always be directed to the activator..." just to make it clear that this is a development checkpoint. And you should probably link to an issue that tracks the work to route traffic away from the activator when the revision is in the active state.

josephburnett · 2018-03-30T21:47:03Z

This is a nice checkpoint for development. The activator isn't currently part of the serving path, so checking this in doesn't disrupt anything. I saw it successfully activate a revision so we know it works. :)

mdemirhan · 2018-04-02T16:36:48Z

+)
+
+func main() {
+	glog.Info("Starting the elafros activator...")


flag.Parse() before using glog. Otherwise, you will get error messages prefixed into your logs.

mdemirhan · 2018-04-02T16:53:40Z

+	glog.Info("Found revision ", revision.GetName())
+	glog.Info("Start to proxy request...")
+	switch revision.Spec.ServingState {
+	case v1alpha1.RevisionServingStateActive:


Do we plan to add throttling in case of steep spikes in the calls? We probably don't want to queue thousands of requests per route. This will also come in handy in case we have problems scaling up or even creating the revision.

@mdemirhan, yes! We really need to throttle during normal service as well. I opened #602 to track this work.

vaikas · 2018-04-02T17:26:22Z

 func makeIstioRouteSpec(u *v1alpha1.Route, tt *v1alpha1.TrafficTarget, ns string, routes []RevisionRoute, domain string) istiov1alpha2.RouteRuleSpec {
+	destinationWeights := []istiov1alpha2.DestinationWeight{}
+	placeHolderK8SServiceName := controller.GetElaK8SServiceName(u)
 	// if either current or next is inactive, target them to proxy instead of


This comment seems a bit out of date now?

vaikas · 2018-04-02T17:29:35Z

+	// The experiment flag to turn on activator feature. If set to true, the traffic
+	// will always be directed to the activator.
+	enableActivatorExperiment = false
 )


Is the only way to enable this to recompile? Could we make this into a flag so that 'easy elafros' would allow folks to play with this feature.

vaikas · 2018-04-02T17:30:50Z

+		if err == nil {
+			return resp, nil
+		}
+		timeout = timeout + timeout


Should probably use an exponential backoff here?

vaikas · 2018-04-02T18:13:01Z

+	glog.Info("Found revision ", revision.GetName())
+	glog.Info("Start to proxy request...")
+	switch revision.Spec.ServingState {
+	case v1alpha1.RevisionServingStateActive:


Yeah, the Spec here is not what we want to key off and Revision status should indeed be what we use to determine the actual status of k8s service.

Use deployment for activator, instead of a pod Use a flag for activator feature flag enableActivatorExperiment Add appendHeaders blob to istio route rule definition

josephburnett · 2018-04-06T16:12:44Z

+)
+
+// FakeRetryingRoundTripper implements RetryingRoundTripperInterface
+type FakeRetryingRoundTripper struct{}


Let's rip out the RetryingRoundTripper. This was leftover from a previous 0-to-1 scaling strategy that used warm pods. It was necessary to probe for readiness through aggressive retries. But we have a better signal in service readiness and actual readiness checks.

To make sure the requests are forwarded only when the service is ready, we will establish a watch on the service to look for the ready condition. @vaikas-google was talking about this.

josephburnett · 2018-04-06T16:20:47Z

+	glog.Info("Found revision ", revision.GetName())
+	glog.Info("Start to proxy request...")
+	switch revision.Spec.ServingState {
+	case v1alpha1.RevisionServingStateActive:


@mdemirhan, yes! We really need to throttle during normal service as well. I opened #602 to track this work.

josephburnett · 2018-04-06T16:21:50Z

+		proxyRequest(w, r, serviceURL, a.retryingTripper)
+	case v1alpha1.RevisionServingStateReserve:
+		// The revision is inactive. Enqueue the request and activate the revision
+		glog.Info("the revision is inactive. Activating it and enqueuing reqfretryinguest")


"reqfretryinguest" ???

josephburnett · 2018-04-06T16:24:44Z

+		glog.Info("The revision is active. Forwarding request to service at ", serviceURL)
+		proxyRequest(w, r, serviceURL, a.retryingTripper)
+	case v1alpha1.RevisionServingStateReserve:
+		// The revision is inactive. Enqueue the request and activate the revision


We don't actually enqueue the request. We just proxy it immediately. If you don't want to setup the service watch now, open an issue to track that work and link it in here.

I would actually prefer that you left that for a separate issue so more than one person can work on this.

josephburnett · 2018-04-06T16:28:40Z

+		})
+
+	appendHeaders := make(map[string]string)
+	appendHeaders["revision"] = "configuration-example-00001"


You will need to remove hardcoded revisions before committing this.

I believe the plan is to use Istio's appendHeaders feature to have it explicitly communicate the destination. But that's broken, so for now, just lookup the necessary information to figure out which revision this was intended for (as a temporary hack).

mattmoor · 2018-04-19T18:42:07Z

+  name: ela-activator
+  namespace: ela-system
+spec:
+  replicas: 1


This feels like it should be replicated (maybe 2-3?) to avoid being a SPOF and dropping requests when a single replica is evicted.

I don't expect it for this change, but I'd like to better understand what our provisioning and autoscaling strategy for this will be longer term. Replicas in some proportion to cluster size feels right to me, but is a total WAG. Can we open an issue to track this? cc @josephburnett

mattmoor · 2018-04-19T19:34:59Z

 	endpoint := obj.(*corev1.Endpoints)
 	eName := endpoint.Name
 	namespace := endpoint.Namespace
+	if namespace == "ela-system" {


I don't understand why we need this?

mattmoor · 2018-04-19T19:44:15Z

+	if err != nil {
+		return
+	}
+	wi, err := a.elaClient.ElafrosV1alpha1().Revisions(revision.GetNamespace()).Watch(metav1.ListOptions{


I believe that this will result in us starting a watch per request, which still feels expensive (tho less so than our prior polling!).

I think that we should change the activator to create a single informer that watches all revisions and flushes the buffered requests for each revision as it becomes ready.

That said, I would be fine with checkpointing as-is, and following-up (open an issue to track), since it should be an implementation detail / optimization of this package.

Right now, it's one watch object per unique revision, since there is a pendingRequests map in process().
// Add a watch for each unique revision

It's natural to group requests per unique revision -- the revisions won't be ready at the same time.

You can still group requests by Revision and have a single watch over all Revisions.

mattmoor · 2018-04-19T19:47:59Z

+			doneCh:    make(chan bool),
+		}
+		a.chans.revisionRequestCh <- revRequest
+		<-revRequest.doneCh


This is a boolean channel, but you aren't checking what comes through?

As you suggested, changed to chan struct{}

mattmoor · 2018-04-19T19:48:33Z

+			doneCh:    make(chan bool),
+		}
+		a.chans.revisionRequestCh <- revRequest
+		<-revRequest.doneCh


Same as above.

mattmoor · 2018-04-19T20:18:45Z

+		case revToWatch := <-a.chans.watchCh:
+			go a.watchForReady(revToWatch)
+		case revToActivate := <-a.chans.activateCh:
+			go a.activate(revToActivate)


These are asynchronous, so process() could mutate the RevisionRequest while it is being processed, right?

Do you think I should use sync.Map?
https://golang.org/pkg/sync/#Map

The map isn't what's concurrently accessed, the element is.

mattmoor · 2018-04-19T20:21:29Z

+	activationDoneCh  chan (string)
+	revisionRequestCh chan (RevisionRequest)
+	watchCh           chan (string)
+}


I did a pass over this, and tried to grok enough of the flow to meaningfully review, but the volume of chan usage makes this particularly difficult to follow. I also have concerns around the thread safety and the potential for leaking go routines.

Consider looking in the golang sync package for abstractions that could be useful. cc @grantr may also have thoughts.

@josephburnett I'll defer to your judgement on whether we want this to go in as-is, but I'm concerned about the complexity of this. I can't confidently say that this has no races, deadlocks, leaks, nor that I'd be able to debug it if it fails, which concerns me for such a critical component. That said, I can appreciate a desire to checkpoint a functional feature, where the complexity is hidden inside of a reasonably defined library.

Even if this goes in as-is, we should seriously consider ways to refactor the internals here to better avoid channels (not entirely, but we can't need 6+), perhaps by better encapsulating parts of this logic in outlined datatypes or methods.

Ack. I would like this to go in as-is so that me and Nicole can start iterating on it. It is too complex and not production ready. But it's behind an experiment and it works enough to demo.

We will simplify this over the next two weeks as we fine tune the code and load test it.

mattmoor · 2018-04-19T20:22:48Z

+	return arr[0], arr[1], nil
+}
+
+func (a *Activator) getRevisionTargetURL(revision *v1alpha1.Revision) (string, error) {


FWIW, I might have had an easier time following the use of channels if a lot of these helper methods were split into another file in the package.

mattmoor · 2018-04-19T20:23:44Z

+	if err != nil {
+		glog.Errorf("Failed to parse target URL: %s. Error: %v", serviceURL, err)
+		http.Error(revRequest.w, "Failed to forward request.", http.StatusBadRequest)
+		return


It feels like these paths should also be signaling that they are done?

mattmoor · 2018-04-19T20:34:55Z

+
+	appendHeaders := make(map[string]string)
+	if len(u.Status.Traffic) > 0 {
+		appendHeaders[controller.GetRevisionHeaderName()] = u.Status.Traffic[0].RevisionName


Hmm... What if len(u.Status.Traffic) == 2?

I added a TODO for this. We will handle this later #693.

mattmoor · 2018-04-20T17:50:12Z

 	glog.Info("Started Activator")
-	go a.process()
+	quitCh := make(chan struct{})
+	go a.process(quitCh)


Why can't we pass stopCh here?

I was worried that the second usage of stopCh would introduce an error. If the process() popped the element out, Run doesn't know we have received the stop signal. Please let me know if that's not the case.

mattmoor · 2018-04-20T17:50:30Z

 	appendHeaders := make(map[string]string)
 	if len(u.Status.Traffic) > 0 {
+		// TODO: Deal with the case when the route has more than one traffic targets.
+		// https://github.com/elafros/elafros/issues/693


It feels worth noting inline that this depends on an Istio feature.

mattmoor · 2018-04-20T17:54:39Z

-	if len(svc.Spec.Ports) != 1 {
-		return "", fmt.Errorf("need just one port. Found %v ports", len(svc.Spec.Ports))
+	ip := endpoint.Subsets[0].Addresses[0].IP
+	port := endpoint.Subsets[0].Ports[0].Port


This change wholly eliminates any load balancing we'd get if the backing service scaled up, which concerns me. This too should have an issue tracking it.

mattmoor · 2018-04-20T18:11:58Z

-	proxy := httputil.NewSingleHostReverseProxy(target)
+func (a *Activator) proxyRequest(revRequest RevisionRequest, serviceURL *url.URL) {
+	glog.Infof("Sending a proxy request to %q", serviceURL)
+	proxy := httputil.NewSingleHostReverseProxy(serviceURL)


Your refactoring has highlighted that this is being called inside of a loop. I have no sense for how complex this constructor is, which worries me a little bit.

mattmoor · 2018-04-20T18:13:43Z

-func (a *Activator) process() {
-	pendingRequests := make(map[string][]RevisionRequest)
+func (a *Activator) process(quitCh chan struct{}) {
+	// TODO: https://golang.org/pkg/sync/#Map


IDK what this TODO means, since you're using it below?

mattmoor · 2018-04-20T18:14:01Z

-	pendingRequests := make(map[string][]RevisionRequest)
+func (a *Activator) process(quitCh chan struct{}) {
+	// TODO: https://golang.org/pkg/sync/#Map
+	var pendingRequests sync.Map //map[string][]RevisionRequest


remove commented code?

This comment is to show what type of map it is, for readability.

mattmoor · 2018-04-20T18:15:08Z

+		case revToWatch := <-a.chans.watchCh:
+			go a.watchForReady(revToWatch)
+		case revToActivate := <-a.chans.activateCh:
+			go a.activate(revToActivate)


The map isn't what's concurrently accessed, the element is.

mattmoor · 2018-04-20T18:16:26Z

+	if err != nil {
+		return
+	}
+	wi, err := a.elaClient.ElafrosV1alpha1().Revisions(revision.GetNamespace()).Watch(metav1.ListOptions{


You can still group requests by Revision and have a single watch over all Revisions.

mattmoor · 2018-04-20T18:27:07Z

+	activationDoneCh  chan (string)
+	revisionRequestCh chan (RevisionRequest)
+	watchCh           chan (string)
+}


@josephburnett I'll defer to your judgement on whether we want this to go in as-is, but I'm concerned about the complexity of this. I can't confidently say that this has no races, deadlocks, leaks, nor that I'd be able to debug it if it fails, which concerns me for such a critical component. That said, I can appreciate a desire to checkpoint a functional feature, where the complexity is hidden inside of a reasonably defined library.

Even if this goes in as-is, we should seriously consider ways to refactor the internals here to better avoid channels (not entirely, but we can't need 6+), perhaps by better encapsulating parts of this logic in outlined datatypes or methods.

mattmoor · 2018-04-20T18:28:43Z

 const (
 	requestQueueLength = 100
+	happyPath          = true
+	sadPath            = false


what are these? :)

google-prow-robot · 2018-04-20T20:19:57Z

@akyyy: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-elafros-bazel-test	`70f264a`	link	`/test pull-elafros-bazel-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

akyyy · 2018-04-20T22:30:10Z

/retest

mattmoor · 2018-04-20T22:54:34Z

/approve
/lgtm

google-prow-robot · 2018-04-20T22:54:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akyyy, mattmoor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [mattmoor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

* Add activator for 0->1 case. The feature experiment flag is enableActivatorExperiment. By default it's false. When it's true, the traffic will be directed to activator-service. Right now activator uses the hard coded revision "configuration-example-00001". This will be addressed by knative#552. * update one comment * Add unit test Use deployment for activator, instead of a pod Use a flag for activator feature flag enableActivatorExperiment Add appendHeaders blob to istio route rule definition * remove retryingtripper.go * Add a couple of todo links * watch works * Use a queue and watch object for each unique revision * Use channels * activator with channels * fieldSelector * update names * Send watch ready event to watch channel * delete a binary * merge * add unit tests * change back * remove debug code * Update test since we allow configuration change based on revision change in this pr * address cr comments * add error handling channel * Use sync.Map for pendingRequests * Talk to endpoint directly * fix tests with existing endpoints * disable activator by default * delete extra comment * Change scaleToZeroThreshold back to 5 min * Update comments

* Add github action to sync upstream * Fix comment

akyyy requested review from josephburnett and vaikas March 9, 2018 22:22

google-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 9, 2018

yanweiguo reviewed Mar 9, 2018

View reviewed changes

mattmoor suggested changes Mar 10, 2018

View reviewed changes

mattmoor suggested changes Mar 12, 2018

View reviewed changes

akyyy changed the title ~~activator for 0->1 case~~ [WIP] activator for 0->1 case Mar 12, 2018

google-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 12, 2018

mattmoor suggested changes Mar 14, 2018

View reviewed changes

josephburnett suggested changes Mar 15, 2018

View reviewed changes

akyyy changed the title ~~[WIP] activator for 0->1 case~~ activator for 0->1 case Mar 30, 2018

google-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2018

josephburnett approved these changes Mar 30, 2018

View reviewed changes

update one comment

ca921d1

mdemirhan reviewed Apr 2, 2018

View reviewed changes

vaikas reviewed Apr 2, 2018

View reviewed changes

akyyy added the WIP label Apr 2, 2018

akyyy changed the title ~~activator for 0->1 case~~ [WIP] activator for 0->1 case Apr 2, 2018

google-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 2, 2018

Add unit test

707ec26

Use deployment for activator, instead of a pod Use a flag for activator feature flag enableActivatorExperiment Add appendHeaders blob to istio route rule definition

akyyy requested a review from evankanderson as a code owner April 5, 2018 20:37

google-prow-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 5, 2018

josephburnett reviewed Apr 6, 2018

View reviewed changes

akyyy added 3 commits April 8, 2018 10:45

Merge remote-tracking branch 'upstream/master'

84a1c8c

remove retryingtripper.go

d4a9397

Add a couple of todo links

8ead2f0

akyyy changed the title ~~[WIP] activator for 0->1 case~~ activator for 0->1 case Apr 19, 2018

google-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2018

mattmoor suggested changes Apr 19, 2018

View reviewed changes

tcnghia mentioned this pull request Apr 20, 2018

Mark a revision ready at the right time #660

Closed

akyyy added 5 commits April 20, 2018 01:13

add error handling channel

a445917

Use sync.Map for pendingRequests

d4d9542

Talk to endpoint directly

8ab2981

fix tests with existing endpoints

0ee6a60

disable activator by default

95d35c6

google-prow-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 20, 2018

akyyy added 2 commits April 20, 2018 10:41

delete extra comment

d58b834

Change scaleToZeroThreshold back to 5 min

74e60b2

mattmoor reviewed Apr 20, 2018

View reviewed changes

akyyy added 2 commits April 20, 2018 12:43

Update comments

cdeaed5

Merge remote-tracking branch 'upstream/master'

ceafcde

mattmoor approved these changes Apr 20, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master'

cfc6d3d

akyyy removed the WIP label Apr 20, 2018

google-prow-robot assigned mattmoor Apr 20, 2018

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2018

google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2018

google-prow-robot merged commit 09ee239 into knative:master Apr 20, 2018

tcnghia mentioned this pull request Oct 19, 2018

Add mdemirhan to /OWNERS #2264

Merged

nak3 added a commit to nak3/serving that referenced this pull request Jun 13, 2023

[SRVKS-1071] Add github action to sync upstream (knative#341)

650511a

* Add github action to sync upstream * Fix comment

Conversation

akyyy commented Mar 9, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akyyy Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akyyy Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akyyy Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattmoor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akyyy Mar 10, 2018 •

edited

Loading

akyyy Mar 12, 2018 •

edited

Loading

akyyy Mar 12, 2018 •

edited

Loading

akyyy Mar 30, 2018 •

edited

Loading