Define and require "ID stability" for agency_id, route_id, and stop_id fields#171
Define and require "ID stability" for agency_id, route_id, and stop_id fields#171antrim wants to merge 2 commits intogoogle:masterfrom
Conversation
ID stability required for agency_id, route_id, stop_id
|
+1 |
|
While I totally support this as consumer, as producer this just can't be achieved beyond best effort for trip_ids. Now it seems that @antrim understands this and did not added it there, but this also leaves a gap. I would suggest for trip_id: ID stability strongly recommended (see term definitions) to facilitate realtime matching across datasets. |
|
+1 for this proposal and +1 with @skinkie comment. trip_id collisions is hard to handle, and it's worse when realtime is available. |
|
Maybe it should be suggested that if trip_id stability cannot be guaranteed, the feed producer should aim to use totally new IDs instead of reusing ones from previous feed iterations. For example: Feed version 1 contains trip IDs Of course if the example is too long for the main spec it could be explained in best practices. |
|
+1 from Google |
|
This definition of "ID stability" is included in a table entitled "Field Types". This is the only item in the table which elaborates on a concept instead of defining a type. I notice that elsewhere in the document many fields have the type of "ID" which already appears in the Field Types table. A good solution might be to convert the "ID stability" item into a type definition for a "Stable ID", which extends (refines or narrows) the definition of an "ID". Then in field definitions, the type could just be listed as "Stable ID" without the extra text "ID stability is required (see term definitions)." I also find that the text in the definition of "ID stability" is ambiguous. In the statement "ID values must not be reused to describe a different entity than in the original usage", I don't think it will be clear to the first time reader whether we're talking about reusing identifiers between different tables, between different successive versions of a single GTFS dataset, or between completely separate GTFS datasets from different producers. This is partly because the word "also" in the second sentence "ID values for entities should also be maintained across datasets" implies that the first sentence is not about maintaining values across datasets. The definition of "dataset" in the same file introduces the concept of new versions of a dataset, with all successive versions of a dataset accessible at a single stable URL. Does "across datasets" then refer to multiple distinct datasets, possibly from different producers, which would be published at distinct URLs? Is this implying that the producers should coordinate to ensure that other producers do not reuse their IDs (e.g. using a national stop registry)? |
|
Can we have opinions of producers here? How difficult it would be for them to support. Even for consumer, on surface it's an obvious win, but how should we expect to use this new feature on the spec? We all know it's going to take years for all agencies to support this, which means most consumers will need to keep their old way of doing stable IDs for a very long time. On top of that, there's no way to know if a producers follow which version of the spec. Wouldn't it be better to create a new field stable_id? At least then consumers expect that if that field is filled, we can count on stability. If not, we should maybe start thinking about a spec version number that's included in the GTFS to know which feature are available (and allow backward incompatible change in the future). |
|
@gcamp line/route and stop stability (network related) is something that all major vendors support / require. Versioning of them is even something relatively new. |
|
I agree with @gcamp that for ID stability to be really useful we need an explicit indicator that the agency is actually providing stable IDs - this could be by versioning the spec and requiring all agencies supporting spec version > X to have stable IDs, or by providing new stable ID fields for all tables. Spec versioning implies that we expect all producers to eventually support stable IDs (at least for the tables we specify), as there may be other versioned features that an agency would like to adopt that would also require adopting stable IDs. Adding a new stable ID field is less aggressive and allows producers that can support this to support this, but wouldn't necessarily force the issue the way that a versioned spec would. |
Change in definition approach: * "ID stability" -> "Stable ID" * Change in definition, introducing "version" rather than "dataset"
|
For the sake of further discussion:
|
|
(Speaking for Trillium) I think trip_id should not be expected to be a Stable ID. We support trip_id stability within our GTFS-producing software, however,
+1 for PR as currently drafted. If trip_id were also made a Stable ID, then the vote would be a +0 |
I would strongly disagree that major vendors support stable ids. Route_ID specifically is auto-incremented with every new schedule period database... as are Trip_ID values. The only stable value is Stop_ID. |
|
@tesobota I know this is the case for others too. Can you say who your GTFS export vendor is? |
|
At Metro Transit in Minneapolis/St. Paul, Minnesota, route_id and trip_id are auto-incremented with each new schedule period too. We use HASTUS. I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to better match the actual stop location. stop_name may be updated for clarity. As ADA pads and sidewalk networks are improved, wheelchair_boarding may change too. Would it be acceptable to make these updates without changing the stop_id? |
|
On Fri, 5 Jul 2019, lauramatson wrote:
At Metro Transit in Minneapolis/St. Paul, Minnesota, route_id and trip_id are auto-incremented with each new schedule period too. We use HASTUS.
Then you should really use a better export OIG script. route_id should
certaintly not increment per schedule period.
I'm not sure I understand what "stability" means or would require. For instance, a stop_id won't be reused for a meaningfully different stop. But sometimes lat/long are corrected to
better match the actual stop location. stop_name may be updated for clarity. As ADA pads and sidewalk networks are improved, wheelchair_boarding may change too. Would it be acceptable to
make these updates without changing the stop_id?
In Hastus (< 2018) you are not able to do network versioning. Hence
stop_id's will remain the same. So yes, these updates will use the same
stop_id (place/stop).
The problem is the following. What if a stop is moved during a specific
period for example 2019-08-01 - 2019-08-14 and after (and until) that
moment the stop is slightly moved around the corner.
To facilitate this in Hastus you would likely to have a new stop_id to
work within the same schedule, with alternative links (length between
stops change). Alternatively you would have a different schedule for only
this period, and manually update location upon export.
A move like this would create a new stop_id. This sounds 'okayish' but if
a rider would have a subscription on a specific stop_id, they would not
receive updates for the next stop_id. Hence it should be announced that
the stop_id (regular) is temporary replaced by stop_id (temporary).
This kind of glue does not exist in GTFS or GTFS-RT at this moment.
|
|
Thanks, @lauramatson!
What a useful prompt. I can see cases where a stop_name and lat/lon are both updated but stop_id should be maintained. Here are two hypothetical examples where a stable stop_id might be maintained:
In both these cases, the stops stay the same in their relation to the network (i.e. the order of the stops within a trip stays the same). Use cases related to stable IDs:
Proposed next step: |
|
Any other users of HASTUS out there besides @lauramatson at Metro Transit? Are route_id and trip_id auto-incremented in every implementation? |
|
Use case example: digital signage and kiosks As a consumer of GTFS data, we depend on the stability of values for displaying accurate information on in-person displays. Generally, this takes the form of:
Conceptually, we need to be able to say things like "the relevant transit information for X, Y, and Z routes/stops/trips should show up on A, B, or C screens." Without stable values for stop_id and, ideally, route_id, it's substantially more difficult to correctly deliver relevant stop times, RT trip updates, alerts, etc. to devices placed in the built environment. |
@antrim while we do not export GTFS from Hastus, but rather the national interface and NeTEx, our route_ids are stable. The trip_ids are not. md_line_id_trp is the identifier for the line/route, it is use defined, because for different operators, there may be a custom mapping. |
tl;dr @mbta also uses HASTUS for our bus and subway schedules: the route IDs and trip IDs are both incremented each schedule change, but we post-process the route IDs to keep them stable. We get an export from HASTUS when the bus/subway schedule changes. (Other modes are managed separately). As a part of our GTFS processing before the file is released to the public, we convert the route IDs to something that's stable between changes. We use them across many different clients and don't want to confuse people by showing them a new route when it's the same number that they're used to: we use the IDs in places like the website where people might have direct URLs bookmarked. Example HASTUS-exported routes.txt: Example post-processed routes.txt: We don't do the same with the trip IDs. Those are also for incremented for each schedule change and we pass those through unmodified. |
|
@paulswartz can you show us the part of the OIG that does this? |
|
@skinkie it's not part of the OIG: the export we get from HASTUS in the first example has the auto-incremented route IDs. We have a set of Python scripts to handle the post-processing and merging with other modes: that's where the route IDs are changed. |
|
@paulswartz how are you exporting information from Hastus without an OIG? |
|
Sorry for the confusion! We have an OIG to do the export, and we get route_ids like |
|
To respond to @antrim's request for how ID stability impacts real applications, here are 3 ways it affects the OneBusAway app (I'm speaking specifically on the Android implementation, but iOS is similar):
For 1), we allow users to bookmark stops in the mobile app and use stop_id as the key to identify the stop. If the agency changes the stop_id, it breaks that bookmark, and as a result stop_ids that refer to the "same stop" from a riders perspective should retain the same stop_id: For 2), we allow users to set recurring reminders for when their bus will arrive on a particular route at a particular stop, based on real-time information. We currently use a trip_id, but this breaks when the schedule is updated and trip_ids change. In a perfect world trip_id would be stable as well, but I know there are more challenges here. It's on my TODO list to improve this to survive GTFS updates, which would include using route_id and trip_headsign (see below). Changes to stop_id also breaks this feature, so again, stop_id should be stable. For 3), bookmarking favorite routes (including at stops), we allow users to pin arrivals for particular routes to the top of the list, either for all stops or at certain stops. Because of changing trip IDs (we learned from 2) above as this feature was implemented later), for this feature we use a combination of route_id and trip_headsign, and stop_id if the user chooses a particular stop. If the route_id and trip_headsign (and stop_id if used) change, this obviously breaks the favorite. So route_id and stop_id should also be stable. Ideally trip_headsign would also be stable when representing the same overall information, but again this is more complicated. |
|
Echoing some of @barbeau's comment: For us at Google, the situation where the same id values end up being reused for different entities presents a problem in detecting changes from one version of the feed to another. Stop names and locations, route configurations/shapes/stop sequences routinely change between subsequent updates of the feed. Some scenarios where this is important to our end users include starring stations or configuring commute. In these scenarios a stable link between the new and the old entities is important to retaining user preferences. |
|
@skinkie our OIG script (@metrotransit Minnesota) adds the route version number to the route id. It does this because occasionally we change some aspect of the route, typically the route_long_name. We publish a GTFS weekly that includes the next 6+ weeks of service. So at a pick boundary our GTFS has two full schedules in it. Here's an example from our current schedule of a route that changes
While it's a fairly minor issue, if a customer plans a trip in the future on the next schedule and route_long_name has changed, we'd like them to see the new route_long_name. Likewise, we'd like them to see the current route_long_name for current service. The only way to do that is to version your route_ids. Most of the time I think this is a fairly minor issue and we could use the route metadata associated with just the first value for any route_id. Using the same route as above our
The problem is I'm not sure if we can have HASTUS select only the first instance of each field, or if we have to do some kind of post processing the way @paulswartz describes at @mbta. This is all the long way of saying that requiring route_id stability prevents an agency from changing any field in |
At GIRO, we are in agreement with botanize statement. If the GTFS export is for a period where a route change, we have no other way than adding the route version to the route identifier in order to differenciate both versions of the route. To keep the route identifier stable, the export should be for a period where all routes do not change. If the goal is just to identify the route for the public, the route short name should be stable. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
This pull request has been closed due to inactivity. Pull requests can always be reopened after they have been closed. See the Specification Amendment Process. |

"Where ID stability is required, ID values must not be reused to describe a different entity than in the original usage. ID values for entities should also be maintained across datasets."
This change would allow applications to depend on more stable IDs.
For the sake of brevity, an "ID stability" term definition was added, and referenced in the definitions of ID fields.