Define and require "ID stability" for agency_id, route_id, and stop_id fields by antrim · Pull Request #171 · google/transit

antrim · 2019-07-02T05:57:22Z

"Where ID stability is required, ID values must not be reused to describe a different entity than in the original usage. ID values for entities should also be maintained across datasets."

This change would allow applications to depend on more stable IDs.

For the sake of brevity, an "ID stability" term definition was added, and referenced in the definitions of ID fields.

ID stability required for agency_id, route_id, stop_id

flocsy · 2019-07-02T07:32:40Z

+1

skinkie · 2019-07-02T07:43:39Z

While I totally support this as consumer, as producer this just can't be achieved beyond best effort for trip_ids. Now it seems that @antrim understands this and did not added it there, but this also leaves a gap. I would suggest for trip_id: ID stability strongly recommended (see term definitions) to facilitate realtime matching across datasets.

prhod · 2019-07-02T08:04:35Z

+1 for this proposal and +1 with @skinkie comment. trip_id collisions is hard to handle, and it's worse when realtime is available.

abyrd · 2019-07-02T08:22:06Z

Maybe it should be suggested that if trip_id stability cannot be guaranteed, the feed producer should aim to use totally new IDs instead of reusing ones from previous feed iterations. For example:

Feed version 1 contains trip IDs trip_A and trip_B. Feed version 2 contains the same two logical trips but adds a new third one. The preferred approach is to continue calling the first two trips trip_A and trip_B, and add a new ID like trip_C. If this cannot be achieved due to the nature of the producer software, the second best approach is to retire trip_A and trip_B, moving on to trip_C, trip_D, and trip_E to avoid false-positive matches of realtime data for trip_A and trip_B.

Of course if the example is too long for the main spec it could be explained in best practices.

aababilov · 2019-07-02T08:34:48Z

+1 from Google

abyrd · 2019-07-02T08:47:02Z

This definition of "ID stability" is included in a table entitled "Field Types". This is the only item in the table which elaborates on a concept instead of defining a type. I notice that elsewhere in the document many fields have the type of "ID" which already appears in the Field Types table. A good solution might be to convert the "ID stability" item into a type definition for a "Stable ID", which extends (refines or narrows) the definition of an "ID". Then in field definitions, the type could just be listed as "Stable ID" without the extra text "ID stability is required (see term definitions)."

I also find that the text in the definition of "ID stability" is ambiguous. In the statement "ID values must not be reused to describe a different entity than in the original usage", I don't think it will be clear to the first time reader whether we're talking about reusing identifiers between different tables, between different successive versions of a single GTFS dataset, or between completely separate GTFS datasets from different producers. This is partly because the word "also" in the second sentence "ID values for entities should also be maintained across datasets" implies that the first sentence is not about maintaining values across datasets.

The definition of "dataset" in the same file introduces the concept of new versions of a dataset, with all successive versions of a dataset accessible at a single stable URL. Does "across datasets" then refer to multiple distinct datasets, possibly from different producers, which would be published at distinct URLs? Is this implying that the producers should coordinate to ensure that other producers do not reuse their IDs (e.g. using a national stop registry)?

gcamp · 2019-07-02T18:28:13Z

Can we have opinions of producers here? How difficult it would be for them to support.

Even for consumer, on surface it's an obvious win, but how should we expect to use this new feature on the spec? We all know it's going to take years for all agencies to support this, which means most consumers will need to keep their old way of doing stable IDs for a very long time. On top of that, there's no way to know if a producers follow which version of the spec.

Wouldn't it be better to create a new field stable_id? At least then consumers expect that if that field is filled, we can count on stability. If not, we should maybe start thinking about a spec version number that's included in the GTFS to know which feature are available (and allow backward incompatible change in the future).

skinkie · 2019-07-02T19:03:07Z

@gcamp line/route and stop stability (network related) is something that all major vendors support / require. Versioning of them is even something relatively new.

gtfs/spec/en/reference.md

barbeau · 2019-07-02T19:28:23Z

I agree with @gcamp that for ID stability to be really useful we need an explicit indicator that the agency is actually providing stable IDs - this could be by versioning the spec and requiring all agencies supporting spec version > X to have stable IDs, or by providing new stable ID fields for all tables. Spec versioning implies that we expect all producers to eventually support stable IDs (at least for the tables we specify), as there may be other versioned features that an agency would like to adopt that would also require adopting stable IDs. Adding a new stable ID field is less aggressive and allows producers that can support this to support this, but wouldn't necessarily force the issue the way that a versioned spec would.

Change in definition approach: * "ID stability" -> "Stable ID" * Change in definition, introducing "version" rather than "dataset"

antrim · 2019-07-02T20:56:40Z

For the sake of further discussion:

In the latest commit, I took @abyrd's suggestion to define a "Stable ID" field which references dataset "versions" (following terminology used elsewhere in the Spec).
I'm currently propose and am in favor of writing ID stability for agency_id, route_id, and stop_id into the Spec as a core assumption. While not technically backwards-compatible, this follows the current assumption and practice of many (most?) producers and consumers; this is also defined in the GTFS Best Practices. We can't expect every feed producer to immediately provide ID stability, but this change seems the most effective way to steer feed producers toward ID stability over time.
In response to above, feed consumers would need some mechanism to track when ID stability is not maintained. Feed consumers might develop their own infrastructure to do this and/or we could add a field to declare stability (as @barbeau suggests).

tsherlockcraig · 2019-07-02T22:22:16Z

(Speaking for Trillium) I think trip_id should not be expected to be a Stable ID.

We support trip_id stability within our GTFS-producing software, however,

What trip_id stability would mean seems unclear. Trips can change in a variety of ways between feeds/schedules of a service. The line between what is the same trip and what is a different trip couldn't easily be defined, but simultaneously probably shouldn't be left merely to the choice of the feed producer (as it in effect is for a route).
providing stable trip_ids is difficult within feeds that represent schedules both before and after a service change (because the same trip_id can't be attached to two service_id values)
while I see the practicality of the issue around matching real-time data sets, requesting trip_id stability seems like a hack to get around one of the issues within GTFS data production and consumption systems: that it is anticipated that GTFS static feeds are only published/ingested on a timeframe of days or weeks, rather than minutes or hours. If the expectation were that the GTFS feed could be published and ingested more or less instantaneously, and the producer is using linked_datasets.txt to point to an appropriately linked trip updates GTFS-rt feed, it seems that there would be very few opportunities for the real-time feed and static feed to be out of alignment, and the trip_id could serve as nothing but an internal identifier within those two feeds at that moment in time.

+1 for PR as currently drafted. If trip_id were also made a Stable ID, then the vote would be a +0

tesobota · 2019-07-03T17:43:43Z

@gcamp line/route and stop stability (network related) is something that all major vendors support / require. Versioning of them is even something relatively new.

I would strongly disagree that major vendors support stable ids. Route_ID specifically is auto-incremented with every new schedule period database... as are Trip_ID values. The only stable value is Stop_ID.

barbeau · 2019-07-03T18:19:12Z

@tesobota I know this is the case for others too. Can you say who your GTFS export vendor is?

lauramatson · 2019-07-05T19:24:54Z

At Metro Transit in Minneapolis/St. Paul, Minnesota, route_id and trip_id are auto-incremented with each new schedule period too. We use HASTUS.

I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity. As ADA pads and sidewalk networks are improved, wheelchair_boarding may change too. Would it be acceptable to make these updates without changing the stop_id?

skinkie · 2019-07-05T22:29:17Z

On Fri, 5 Jul 2019, lauramatson wrote: At Metro Transit in Minneapolis/St. Paul, Minnesota, route_id and trip_id are auto-incremented with each new schedule period too. We use HASTUS.

Then you should really use a better export OIG script. route_id should certaintly not increment per schedule period.

I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity. As ADA pads and sidewalk networks are improved, wheelchair_boarding may change too. Would it be acceptable to make these updates without changing the stop_id?

In Hastus (< 2018) you are not able to do network versioning. Hence stop_id's will remain the same. So yes, these updates will use the same stop_id (place/stop). The problem is the following. What if a stop is moved during a specific period for example 2019-08-01 - 2019-08-14 and after (and until) that moment the stop is slightly moved around the corner. To facilitate this in Hastus you would likely to have a new stop_id to work within the same schedule, with alternative links (length between stops change). Alternatively you would have a different schedule for only this period, and manually update location upon export. A move like this would create a new stop_id. This sounds 'okayish' but if a rider would have a subscription on a specific stop_id, they would not receive updates for the next stop_id. Hence it should be announced that the stop_id (regular) is temporary replaced by stop_id (temporary). This kind of glue does not exist in GTFS or GTFS-RT at this moment.

antrim · 2019-07-05T22:46:08Z

Thanks, @lauramatson!

I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity.

What a useful prompt. I can see cases where a stop_name and lat/lon are both updated but stop_id should be maintained.

Here are two hypothetical examples where a stable stop_id might be maintained:

stop locations are updated to be more accurate and the system's naming scheme is updated
a stop was moved from the near side to the far side of the intersection.

In both these cases, the stops stay the same in their relation to the network (i.e. the order of the stops within a trip stays the same).

Use cases related to stable IDs:

In both the above examples, we'd want for favorite/bookmarked stops in users' apps to be maintained.
The second example is a bit more complicated if the ID is being used to associate the stop record with another data element outside of the GTFS (say, an amenity at the stop). In that case a data consumer would likely want to have a mechanism set up to notice the stop has moved significantly and check the linkage.

Proposed next step:
I think we probably should collect more use cases to be able to justify this proposed change, understand its implications, and provide practice recommendations around when IDs should remain stable. Please post use cases on this thread!

antrim · 2019-07-05T22:47:05Z

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

devadvance · 2019-07-05T23:10:02Z

Use case example: digital signage and kiosks

As a consumer of GTFS data, we depend on the stability of values for displaying accurate information on in-person displays. Generally, this takes the form of:

Non-interactive screens with alerts and arrivals/scheduled times
Interactive kiosks with alerts, arrivals/scheduled times, system maps, line maps, and more

Conceptually, we need to be able to say things like "the relevant transit information for X, Y, and Z routes/stops/trips should show up on A, B, or C screens."

Without stable values for stop_id and, ideally, route_id, it's substantially more difficult to correctly deliver relevant stop times, RT trip updates, alerts, etc. to devices placed in the built environment.

skinkie · 2019-07-05T23:37:20Z

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

@antrim while we do not export GTFS from Hastus, but rather the national interface and NeTEx, our route_ids are stable. The trip_ids are not.

md_line_id_trp is the identifier for the line/route, it is use defined, because for different operators, there may be a custom mapping.
https://github.com/skinkie/hastus/blob/master/CompositeFrame/ServiceFrame/lines.ix
https://github.com/skinkie/hastus/blob/master/Configuratie.ix#L14

paulswartz · 2019-07-06T12:36:14Z

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

tl;dr @mbta also uses HASTUS for our bus and subway schedules: the route IDs and trip IDs are both incremented each schedule change, but we post-process the route IDs to keep them stable.

We get an export from HASTUS when the bus/subway schedule changes. (Other modes are managed separately). As a part of our GTFS processing before the file is released to the public, we convert the route IDs to something that's stable between changes. We use them across many different clients and don't want to confuse people by showing them a new route when it's the same number that they're used to: we use the IDs in places like the website where people might have direct URLs bookmarked.

Example HASTUS-exported routes.txt:

route_id,route_short_name,route_long_name,route_desc,route_type
01-1243,01,,,3

Example post-processed routes.txt:

route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color,route_sort_order,route_fare_class,
line_id,listed_route
1,1,1,Harvard - Dudley via Massachusetts Avenue,Key Bus,3,,FFC72C,000000,50010,Local Bus,line-1701,

We don't do the same with the trip IDs. Those are also for incremented for each schedule change and we pass those through unmodified.

skinkie · 2019-07-06T13:22:26Z

@paulswartz can you show us the part of the OIG that does this?

paulswartz · 2019-07-06T15:55:48Z

@skinkie it's not part of the OIG: the export we get from HASTUS in the first example has the auto-incremented route IDs. We have a set of Python scripts to handle the post-processing and merging with other modes: that's where the route IDs are changed.

skinkie · 2019-07-07T11:17:42Z

@paulswartz how are you exporting information from Hastus without an OIG?

paulswartz · 2019-07-08T15:13:57Z

Sorry for the confusion! We have an OIG to do the export, and we get route_ids like 01-1243, where 1243 is autoincremented for each schedule change. The Python scripts which process the HASTUS export convert that into the route_id 1, which matches what riders see on the buses.

barbeau · 2019-07-18T15:09:16Z

To respond to @antrim's request for how ID stability impacts real applications, here are 3 ways it affects the OneBusAway app (I'm speaking specifically on the Android implementation, but iOS is similar):

Bookmarking favorite stops (stop_id)
Setting recurring arriving bus reminders (stop_id, trip_id)
Bookmarking favorite routes (including at stops) (stop_id, route_id, headsign)

For 1), we allow users to bookmark stops in the mobile app and use stop_id as the key to identify the stop. If the agency changes the stop_id, it breaks that bookmark, and as a result stop_ids that refer to the "same stop" from a riders perspective should retain the same stop_id:

For 2), we allow users to set recurring reminders for when their bus will arrive on a particular route at a particular stop, based on real-time information. We currently use a trip_id, but this breaks when the schedule is updated and trip_ids change. In a perfect world trip_id would be stable as well, but I know there are more challenges here. It's on my TODO list to improve this to survive GTFS updates, which would include using route_id and trip_headsign (see below). Changes to stop_id also breaks this feature, so again, stop_id should be stable.

For 3), bookmarking favorite routes (including at stops), we allow users to pin arrivals for particular routes to the top of the list, either for all stops or at certain stops. Because of changing trip IDs (we learned from 2) above as this feature was implemented later), for this feature we use a combination of route_id and trip_headsign, and stop_id if the user chooses a particular stop. If the route_id and trip_headsign (and stop_id if used) change, this obviously breaks the favorite. So route_id and stop_id should also be stable. Ideally trip_headsign would also be stable when representing the same overall information, but again this is more complicated.

dbabramov · 2019-07-30T00:35:26Z

Echoing some of @barbeau's comment:

For us at Google, the situation where the same id values end up being reused for different entities presents a problem in detecting changes from one version of the feed to another.

Stop names and locations, route configurations/shapes/stop sequences routinely change between subsequent updates of the feed.
Wherever other properties change, stable ids allow us to distinguish between a new stop (route) vs a change to an existing one.

Some scenarios where this is important to our end users include starring stations or configuring commute. In these scenarios a stable link between the new and the old entities is important to retaining user preferences.

botanize · 2019-08-09T18:14:58Z

@skinkie our OIG script (@metrotransit Minnesota) adds the route version number to the route id. It does this because occasionally we change some aspect of the route, typically the route_long_name. We publish a GTFS weekly that includes the next 6+ weeks of service. So at a pick boundary our GTFS has two full schedules in it.

Here's an example from our current schedule of a route that changes route_long_name across picks.

route_id	agency_id	route_short_name	route_long_name	route_desc	route_type	route_url	route_color	route_text_color
71-111	0	71	Little Canada - Edgerton - Concord - Inver Hills	NA	3	http://www.metrotransit.org/route/71		000000
71-112	0	71	Little Canada - Westminste - Concord - Inver Hills	NA	3	http://www.metrotransit.org/route/71		000000

While it's a fairly minor issue, if a customer plans a trip in the future on the next schedule and route_long_name has changed, we'd like them to see the new route_long_name. Likewise, we'd like them to see the current route_long_name for current service. The only way to do that is to version your route_ids.

Most of the time I think this is a fairly minor issue and we could use the route metadata associated with just the first value for any route_id. Using the same route as above our routes.txt entry for Route 71 would become this:

route_id	agency_id	route_short_name	route_long_name	route_desc	route_type	route_url	route_color	route_text_color
71	0	71	Little Canada - Edgerton - Concord - Inver Hills	NA	3	http://www.metrotransit.org/route/71		000000

The problem is I'm not sure if we can have HASTUS select only the first instance of each field, or if we have to do some kind of post processing the way @paulswartz describes at @mbta.

This is all the long way of saying that requiring route_id stability prevents an agency from changing any field in routes.txt for a GTFS that includes multiple schedules, which I believe is common practice.

DenisMGiro · 2019-09-27T04:08:58Z

@skinkie our OIG script (@metrotransit Minnesota) adds the route version number to the route id. It does this because occasionally we change some aspect of the route, typically the route_long_name. We publish a GTFS weekly that includes the next 6+ weeks of service. So at a pick boundary our GTFS has two full schedules in it.

Here's an example from our current schedule of a route that changes route_long_name across picks.

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71-111 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
71-112 0 71 Little Canada - Westminste - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
While it's a fairly minor issue, if a customer plans a trip in the future on the next schedule and route_long_name has changed, we'd like them to see the new route_long_name. Likewise, we'd like them to see the current route_long_name for current service. The only way to do that is to version your route_ids.

Most of the time I think this is a fairly minor issue and we could use the route metadata associated with just the first value for any route_id. Using the same route as above our routes.txt entry for Route 71 would become this:

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
The problem is I'm not sure if we can have HASTUS select only the first instance of each field, or if we have to do some kind of post processing the way @paulswartz describes at @mbta.

This is all the long way of saying that requiring route_id stability prevents an agency from changing any field in routes.txt for a GTFS that includes multiple schedules, which I believe is common practice.

At GIRO, we are in agreement with botanize statement. If the GTFS export is for a period where a route change, we have no other way than adding the route version to the route identifier in order to differenciate both versions of the route. To keep the route identifier stable, the export should be for a period where all routes do not change. If the goal is just to identify the route for the public, the route short name should be stable.
As for stops, that should not be an issue since stops are global to the database.

stale · 2021-08-21T03:27:29Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-08-28T16:53:17Z

This pull request has been closed due to inactivity. Pull requests can always be reopened after they have been closed. See the Specification Amendment Process.

Define "ID stability", requirements

3ca5791

ID stability required for agency_id, route_id, stop_id

googlebot added the cla: yes label Jul 2, 2019

barbeau reviewed Jul 2, 2019

View reviewed changes

gtfs/spec/en/reference.md Outdated Show resolved Hide resolved

"Stable ID"

cb7f00f

Change in definition approach: * "ID stability" -> "Stable ID" * Change in definition, introducing "version" rather than "dataset"

gcamp mentioned this pull request Jul 26, 2019

Google's unofficial route_type 8 to 12 #174

Merged

stale bot added the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Aug 21, 2021

stale bot closed this Aug 28, 2021

gcamp mentioned this pull request Aug 24, 2022

Consuming applications not incorporating changes to fields when record id is unchanged #348

Open

Conversation

antrim commented Jul 2, 2019

Uh oh!

flocsy commented Jul 2, 2019

Uh oh!

skinkie commented Jul 2, 2019

Uh oh!

prhod commented Jul 2, 2019

Uh oh!

abyrd commented Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aababilov commented Jul 2, 2019

Uh oh!

abyrd commented Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gcamp commented Jul 2, 2019

Uh oh!

skinkie commented Jul 2, 2019

Uh oh!

Uh oh!

barbeau commented Jul 2, 2019

Uh oh!

antrim commented Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tsherlockcraig commented Jul 2, 2019

Uh oh!

tesobota commented Jul 3, 2019

Uh oh!

barbeau commented Jul 3, 2019

Uh oh!

lauramatson commented Jul 5, 2019

Uh oh!

skinkie commented Jul 5, 2019 via email

Uh oh!

antrim commented Jul 5, 2019

Uh oh!

antrim commented Jul 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devadvance commented Jul 5, 2019

Uh oh!

skinkie commented Jul 5, 2019

Uh oh!

paulswartz commented Jul 6, 2019

Uh oh!

skinkie commented Jul 6, 2019

Uh oh!

paulswartz commented Jul 6, 2019

Uh oh!

skinkie commented Jul 7, 2019

Uh oh!

paulswartz commented Jul 8, 2019

Uh oh!

barbeau commented Jul 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbabramov commented Jul 30, 2019

Uh oh!

botanize commented Aug 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DenisMGiro commented Sep 27, 2019

Uh oh!

stale bot commented Aug 21, 2021

Uh oh!

stale bot commented Aug 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

abyrd commented Jul 2, 2019 •

edited

Loading

abyrd commented Jul 2, 2019 •

edited

Loading

antrim commented Jul 2, 2019 •

edited

Loading

antrim commented Jul 5, 2019 •

edited

Loading

barbeau commented Jul 18, 2019 •

edited

Loading

botanize commented Aug 9, 2019 •

edited

Loading