Skip to content

Define and require "ID stability" for agency_id, route_id, and stop_id fields#171

Closed
antrim wants to merge 2 commits intogoogle:masterfrom
MobilityData:id-stability
Closed

Define and require "ID stability" for agency_id, route_id, and stop_id fields#171
antrim wants to merge 2 commits intogoogle:masterfrom
MobilityData:id-stability

Conversation

@antrim
Copy link
Contributor

@antrim antrim commented Jul 2, 2019

"Where ID stability is required, ID values must not be reused to describe a different entity than in the original usage. ID values for entities should also be maintained across datasets."

This change would allow applications to depend on more stable IDs.

For the sake of brevity, an "ID stability" term definition was added, and referenced in the definitions of ID fields.

ID stability required for agency_id, route_id, stop_id
@flocsy
Copy link
Contributor

flocsy commented Jul 2, 2019

+1

@skinkie
Copy link
Contributor

skinkie commented Jul 2, 2019

While I totally support this as consumer, as producer this just can't be achieved beyond best effort for trip_ids. Now it seems that @antrim understands this and did not added it there, but this also leaves a gap. I would suggest for trip_id: ID stability strongly recommended (see term definitions) to facilitate realtime matching across datasets.

@prhod
Copy link

prhod commented Jul 2, 2019

+1 for this proposal and +1 with @skinkie comment. trip_id collisions is hard to handle, and it's worse when realtime is available.

@abyrd
Copy link

abyrd commented Jul 2, 2019

Maybe it should be suggested that if trip_id stability cannot be guaranteed, the feed producer should aim to use totally new IDs instead of reusing ones from previous feed iterations. For example:

Feed version 1 contains trip IDs trip_A and trip_B. Feed version 2 contains the same two logical trips but adds a new third one. The preferred approach is to continue calling the first two trips trip_A and trip_B, and add a new ID like trip_C. If this cannot be achieved due to the nature of the producer software, the second best approach is to retire trip_A and trip_B, moving on to trip_C, trip_D, and trip_E to avoid false-positive matches of realtime data for trip_A and trip_B.

Of course if the example is too long for the main spec it could be explained in best practices.

@aababilov
Copy link
Contributor

+1 from Google

@abyrd
Copy link

abyrd commented Jul 2, 2019

This definition of "ID stability" is included in a table entitled "Field Types". This is the only item in the table which elaborates on a concept instead of defining a type. I notice that elsewhere in the document many fields have the type of "ID" which already appears in the Field Types table. A good solution might be to convert the "ID stability" item into a type definition for a "Stable ID", which extends (refines or narrows) the definition of an "ID". Then in field definitions, the type could just be listed as "Stable ID" without the extra text "ID stability is required (see term definitions)."

I also find that the text in the definition of "ID stability" is ambiguous. In the statement "ID values must not be reused to describe a different entity than in the original usage", I don't think it will be clear to the first time reader whether we're talking about reusing identifiers between different tables, between different successive versions of a single GTFS dataset, or between completely separate GTFS datasets from different producers. This is partly because the word "also" in the second sentence "ID values for entities should also be maintained across datasets" implies that the first sentence is not about maintaining values across datasets.

The definition of "dataset" in the same file introduces the concept of new versions of a dataset, with all successive versions of a dataset accessible at a single stable URL. Does "across datasets" then refer to multiple distinct datasets, possibly from different producers, which would be published at distinct URLs? Is this implying that the producers should coordinate to ensure that other producers do not reuse their IDs (e.g. using a national stop registry)?

@gcamp
Copy link
Contributor

gcamp commented Jul 2, 2019

Can we have opinions of producers here? How difficult it would be for them to support.

Even for consumer, on surface it's an obvious win, but how should we expect to use this new feature on the spec? We all know it's going to take years for all agencies to support this, which means most consumers will need to keep their old way of doing stable IDs for a very long time. On top of that, there's no way to know if a producers follow which version of the spec.

Wouldn't it be better to create a new field stable_id? At least then consumers expect that if that field is filled, we can count on stability. If not, we should maybe start thinking about a spec version number that's included in the GTFS to know which feature are available (and allow backward incompatible change in the future).

@skinkie
Copy link
Contributor

skinkie commented Jul 2, 2019

@gcamp line/route and stop stability (network related) is something that all major vendors support / require. Versioning of them is even something relatively new.

@barbeau
Copy link
Contributor

barbeau commented Jul 2, 2019

I agree with @gcamp that for ID stability to be really useful we need an explicit indicator that the agency is actually providing stable IDs - this could be by versioning the spec and requiring all agencies supporting spec version > X to have stable IDs, or by providing new stable ID fields for all tables. Spec versioning implies that we expect all producers to eventually support stable IDs (at least for the tables we specify), as there may be other versioned features that an agency would like to adopt that would also require adopting stable IDs. Adding a new stable ID field is less aggressive and allows producers that can support this to support this, but wouldn't necessarily force the issue the way that a versioned spec would.

Change in definition approach:
* "ID stability" -> "Stable ID"
* Change in definition, introducing "version" rather than "dataset"
@antrim
Copy link
Contributor Author

antrim commented Jul 2, 2019

For the sake of further discussion:

  • In the latest commit, I took @abyrd's suggestion to define a "Stable ID" field which references dataset "versions" (following terminology used elsewhere in the Spec).
  • I'm currently propose and am in favor of writing ID stability for agency_id, route_id, and stop_id into the Spec as a core assumption. While not technically backwards-compatible, this follows the current assumption and practice of many (most?) producers and consumers; this is also defined in the GTFS Best Practices. We can't expect every feed producer to immediately provide ID stability, but this change seems the most effective way to steer feed producers toward ID stability over time.
  • In response to above, feed consumers would need some mechanism to track when ID stability is not maintained. Feed consumers might develop their own infrastructure to do this and/or we could add a field to declare stability (as @barbeau suggests).

@tsherlockcraig
Copy link

(Speaking for Trillium) I think trip_id should not be expected to be a Stable ID.

We support trip_id stability within our GTFS-producing software, however,

  1. What trip_id stability would mean seems unclear. Trips can change in a variety of ways between feeds/schedules of a service. The line between what is the same trip and what is a different trip couldn't easily be defined, but simultaneously probably shouldn't be left merely to the choice of the feed producer (as it in effect is for a route).
  2. providing stable trip_ids is difficult within feeds that represent schedules both before and after a service change (because the same trip_id can't be attached to two service_id values)
  3. while I see the practicality of the issue around matching real-time data sets, requesting trip_id stability seems like a hack to get around one of the issues within GTFS data production and consumption systems: that it is anticipated that GTFS static feeds are only published/ingested on a timeframe of days or weeks, rather than minutes or hours. If the expectation were that the GTFS feed could be published and ingested more or less instantaneously, and the producer is using linked_datasets.txt to point to an appropriately linked trip updates GTFS-rt feed, it seems that there would be very few opportunities for the real-time feed and static feed to be out of alignment, and the trip_id could serve as nothing but an internal identifier within those two feeds at that moment in time.

+1 for PR as currently drafted. If trip_id were also made a Stable ID, then the vote would be a +0

@tesobota
Copy link

tesobota commented Jul 3, 2019

@gcamp line/route and stop stability (network related) is something that all major vendors support / require. Versioning of them is even something relatively new.

I would strongly disagree that major vendors support stable ids. Route_ID specifically is auto-incremented with every new schedule period database... as are Trip_ID values. The only stable value is Stop_ID.

@barbeau
Copy link
Contributor

barbeau commented Jul 3, 2019

@tesobota I know this is the case for others too. Can you say who your GTFS export vendor is?

@lauramatson
Copy link

At Metro Transit in Minneapolis/St. Paul, Minnesota, route_id and trip_id are auto-incremented with each new schedule period too. We use HASTUS.

I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity. As ADA pads and sidewalk networks are improved, wheelchair_boarding may change too. Would it be acceptable to make these updates without changing the stop_id?

@skinkie
Copy link
Contributor

skinkie commented Jul 5, 2019 via email

@antrim
Copy link
Contributor Author

antrim commented Jul 5, 2019

Thanks, @lauramatson!

I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity.

What a useful prompt. I can see cases where a stop_name and lat/lon are both updated but stop_id should be maintained.

Here are two hypothetical examples where a stable stop_id might be maintained:

  1. stop locations are updated to be more accurate and the system's naming scheme is updated
  2. a stop was moved from the near side to the far side of the intersection.

In both these cases, the stops stay the same in their relation to the network (i.e. the order of the stops within a trip stays the same).

Use cases related to stable IDs:

  • In both the above examples, we'd want for favorite/bookmarked stops in users' apps to be maintained.
  • The second example is a bit more complicated if the ID is being used to associate the stop record with another data element outside of the GTFS (say, an amenity at the stop). In that case a data consumer would likely want to have a mechanism set up to notice the stop has moved significantly and check the linkage.

Proposed next step:
I think we probably should collect more use cases to be able to justify this proposed change, understand its implications, and provide practice recommendations around when IDs should remain stable. Please post use cases on this thread!

@antrim
Copy link
Contributor Author

antrim commented Jul 5, 2019

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

@devadvance
Copy link

Use case example: digital signage and kiosks

As a consumer of GTFS data, we depend on the stability of values for displaying accurate information on in-person displays. Generally, this takes the form of:

  • Non-interactive screens with alerts and arrivals/scheduled times
  • Interactive kiosks with alerts, arrivals/scheduled times, system maps, line maps, and more

Conceptually, we need to be able to say things like "the relevant transit information for X, Y, and Z routes/stops/trips should show up on A, B, or C screens."

Without stable values for stop_id and, ideally, route_id, it's substantially more difficult to correctly deliver relevant stop times, RT trip updates, alerts, etc. to devices placed in the built environment.

@skinkie
Copy link
Contributor

skinkie commented Jul 5, 2019

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

@antrim while we do not export GTFS from Hastus, but rather the national interface and NeTEx, our route_ids are stable. The trip_ids are not.

md_line_id_trp is the identifier for the line/route, it is use defined, because for different operators, there may be a custom mapping.
https://github.com/skinkie/hastus/blob/master/CompositeFrame/ServiceFrame/lines.ix
https://github.com/skinkie/hastus/blob/master/Configuratie.ix#L14

@paulswartz
Copy link
Contributor

Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation?

tl;dr @mbta also uses HASTUS for our bus and subway schedules: the route IDs and trip IDs are both incremented each schedule change, but we post-process the route IDs to keep them stable.

We get an export from HASTUS when the bus/subway schedule changes. (Other modes are managed separately). As a part of our GTFS processing before the file is released to the public, we convert the route IDs to something that's stable between changes. We use them across many different clients and don't want to confuse people by showing them a new route when it's the same number that they're used to: we use the IDs in places like the website where people might have direct URLs bookmarked.

Example HASTUS-exported routes.txt:

route_id,route_short_name,route_long_name,route_desc,route_type
01-1243,01,,,3

Example post-processed routes.txt:

route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color,route_sort_order,route_fare_class,
line_id,listed_route
1,1,1,Harvard - Dudley via Massachusetts Avenue,Key Bus,3,,FFC72C,000000,50010,Local Bus,line-1701,

We don't do the same with the trip IDs. Those are also for incremented for each schedule change and we pass those through unmodified.

@skinkie
Copy link
Contributor

skinkie commented Jul 6, 2019

@paulswartz can you show us the part of the OIG that does this?

@paulswartz
Copy link
Contributor

@skinkie it's not part of the OIG: the export we get from HASTUS in the first example has the auto-incremented route IDs. We have a set of Python scripts to handle the post-processing and merging with other modes: that's where the route IDs are changed.

@skinkie
Copy link
Contributor

skinkie commented Jul 7, 2019

@paulswartz how are you exporting information from Hastus without an OIG?

@paulswartz
Copy link
Contributor

Sorry for the confusion! We have an OIG to do the export, and we get route_ids like 01-1243, where 1243 is autoincremented for each schedule change. The Python scripts which process the HASTUS export convert that into the route_id 1, which matches what riders see on the buses.

@barbeau
Copy link
Contributor

barbeau commented Jul 18, 2019

To respond to @antrim's request for how ID stability impacts real applications, here are 3 ways it affects the OneBusAway app (I'm speaking specifically on the Android implementation, but iOS is similar):

  1. Bookmarking favorite stops (stop_id)
  2. Setting recurring arriving bus reminders (stop_id, trip_id)
  3. Bookmarking favorite routes (including at stops) (stop_id, route_id, headsign)

For 1), we allow users to bookmark stops in the mobile app and use stop_id as the key to identify the stop. If the agency changes the stop_id, it breaks that bookmark, and as a result stop_ids that refer to the "same stop" from a riders perspective should retain the same stop_id:

image

For 2), we allow users to set recurring reminders for when their bus will arrive on a particular route at a particular stop, based on real-time information. We currently use a trip_id, but this breaks when the schedule is updated and trip_ids change. In a perfect world trip_id would be stable as well, but I know there are more challenges here. It's on my TODO list to improve this to survive GTFS updates, which would include using route_id and trip_headsign (see below). Changes to stop_id also breaks this feature, so again, stop_id should be stable.

For 3), bookmarking favorite routes (including at stops), we allow users to pin arrivals for particular routes to the top of the list, either for all stops or at certain stops. Because of changing trip IDs (we learned from 2) above as this feature was implemented later), for this feature we use a combination of route_id and trip_headsign, and stop_id if the user chooses a particular stop. If the route_id and trip_headsign (and stop_id if used) change, this obviously breaks the favorite. So route_id and stop_id should also be stable. Ideally trip_headsign would also be stable when representing the same overall information, but again this is more complicated.

@dbabramov
Copy link
Contributor

Echoing some of @barbeau's comment:

For us at Google, the situation where the same id values end up being reused for different entities presents a problem in detecting changes from one version of the feed to another.

Stop names and locations, route configurations/shapes/stop sequences routinely change between subsequent updates of the feed.
Wherever other properties change, stable ids allow us to distinguish between a new stop (route) vs a change to an existing one.

Some scenarios where this is important to our end users include starring stations or configuring commute. In these scenarios a stable link between the new and the old entities is important to retaining user preferences.

@botanize
Copy link
Contributor

botanize commented Aug 9, 2019

@skinkie our OIG script (@metrotransit Minnesota) adds the route version number to the route id. It does this because occasionally we change some aspect of the route, typically the route_long_name. We publish a GTFS weekly that includes the next 6+ weeks of service. So at a pick boundary our GTFS has two full schedules in it.

Here's an example from our current schedule of a route that changes route_long_name across picks.

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71-111 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
71-112 0 71 Little Canada - Westminste - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000

While it's a fairly minor issue, if a customer plans a trip in the future on the next schedule and route_long_name has changed, we'd like them to see the new route_long_name. Likewise, we'd like them to see the current route_long_name for current service. The only way to do that is to version your route_ids.

Most of the time I think this is a fairly minor issue and we could use the route metadata associated with just the first value for any route_id. Using the same route as above our routes.txt entry for Route 71 would become this:

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000

The problem is I'm not sure if we can have HASTUS select only the first instance of each field, or if we have to do some kind of post processing the way @paulswartz describes at @mbta.

This is all the long way of saying that requiring route_id stability prevents an agency from changing any field in routes.txt for a GTFS that includes multiple schedules, which I believe is common practice.

@DenisMGiro
Copy link

@skinkie our OIG script (@metrotransit Minnesota) adds the route version number to the route id. It does this because occasionally we change some aspect of the route, typically the route_long_name. We publish a GTFS weekly that includes the next 6+ weeks of service. So at a pick boundary our GTFS has two full schedules in it.

Here's an example from our current schedule of a route that changes route_long_name across picks.

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71-111 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
71-112 0 71 Little Canada - Westminste - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
While it's a fairly minor issue, if a customer plans a trip in the future on the next schedule and route_long_name has changed, we'd like them to see the new route_long_name. Likewise, we'd like them to see the current route_long_name for current service. The only way to do that is to version your route_ids.

Most of the time I think this is a fairly minor issue and we could use the route metadata associated with just the first value for any route_id. Using the same route as above our routes.txt entry for Route 71 would become this:

route_id agency_id route_short_name route_long_name route_desc route_type route_url route_color route_text_color
71 0 71 Little Canada - Edgerton - Concord - Inver Hills NA 3 http://www.metrotransit.org/route/71 000000
The problem is I'm not sure if we can have HASTUS select only the first instance of each field, or if we have to do some kind of post processing the way @paulswartz describes at @mbta.

This is all the long way of saying that requiring route_id stability prevents an agency from changing any field in routes.txt for a GTFS that includes multiple schedules, which I believe is common practice.

At GIRO, we are in agreement with botanize statement. If the GTFS export is for a period where a route change, we have no other way than adding the route version to the route identifier in order to differenciate both versions of the route. To keep the route identifier stable, the export should be for a period where all routes do not change. If the goal is just to identify the route for the public, the route short name should be stable.
As for stops, that should not be an issue since stops are global to the database.

@stale
Copy link

stale bot commented Aug 21, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Aug 21, 2021
@stale
Copy link

stale bot commented Aug 28, 2021

This pull request has been closed due to inactivity. Pull requests can always be reopened after they have been closed. See the Specification Amendment Process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more.

Projects

None yet

Development

Successfully merging this pull request may close these issues.