RFC 110: EXIF metadata removal of Wagtail images#110
RFC 110: EXIF metadata removal of Wagtail images#110Stormheg wants to merge 6 commits intowagtail:mainfrom
Conversation
9f048e3 to
bacb604
Compare
There was a problem hiding this comment.
Briefly discussed during core team meeting today.
Questions raised for me to answer:
- When using Willow image optimizers, will the metadata stay intact or is it removed entirely?
- May need to work around that or document it as a limitation
Investigate more into why other CMS have not made GPS metadata stripping the default- Done, seems to boil down to backwards compatibility and shifting responsibility to users to enable.
bacb604 to
d4c5150
Compare
thibaudcolas
left a comment
There was a problem hiding this comment.
Nice! I’d love to see this in. I think it’d be important for the RFC to do more on configuration as this is probably the trickiest thing to get right here.
In particular I’m a bit worried the current approach would lead to lots of additions of this or that extra tag to strip over time. It might be harder initially but much simpler in the long run if we agreed to "strip everything except essential tags", and then individual sites could opt into keeping the 3-10 extra tags they need. Rather than us all constantly chasing privacy/performance improvements by adding this and that extra tag to the list over time.
text/110-exif-stripping.md
Outdated
|
|
||
| ## Impact on users of Wagtail | ||
|
|
||
| The author is not aware of any Wagtail users who actively rely on EXIF metadata (please make yourself and your use-cases known!), but some may. photography-focused sites are most likely to be affected; for them, defaults can be disabled. |
There was a problem hiding this comment.
I have worked with multiple museum / art sites where publishing licensing metadata within the files was a must. For example Tate in the UK - see this collection page, excerpt from exiftool on the main image:
Copyright Notice: (c) Spire Healthcare, (c) Noise Abatement Society, (c) Samaritans / Photo (c) Tate
Object Name: T02140
Headline: Scylla 1938 by Ithell Colquhoun 1906-1988
Caption-Abstract: Scylla 1938 Ithell Colquhoun 1906-1988 Purchased 1977 http://www.tate.org.uk/art/work/T02140There was a problem hiding this comment.
Thank you for this Thibaud! I have added Tate to the RFC as a known users and expanded the wording of the 'likely affected parties' to include art galleries.
There was a problem hiding this comment.
Note: the Tate website seems down at the moment, 'Error 503 first byte timeout - varnish cache server' so I cannot check the image in question myself.
Based on the tag names from the output you provided, this information is instead stored in IPTC tags. See this documentation page: https://exiftool.org/TagNames/IPTC.html
It might be wise for this RFC to call out this difference in metadata formats in the RFC. This RFC explicitly targets EXIF.
|
|
||
| ## Out of scope / future possibilities | ||
|
|
||
| This RFC opens the door to further possibilities, like showing EXIF data in the Wagtail admin interface, or allowing users to edit certain EXIF tags directly from Wagtail. However, these features are considered out of scope for this RFC and can be explored in future RFCs or feature proposals. |
text/110-exif-stripping.md
Outdated
| - DateTimeOriginal (when the photo was taken) | ||
| - (any other tags determined to be non-sensitive and useful - please suggest!) | ||
|
|
||
| Because this feature is much more rigorous in stripping EXIF data, it will not be the default configuration and must be explicitly enabled by site administrators. |
There was a problem hiding this comment.
I’m not sure I understand the rationale here. Some of this metadata strikes me as essential (without Orientation some images are displayed upside-down on Windows?). Some not at all (DateTimeOriginal).
There was a problem hiding this comment.
Mostly rewritten this when I rearranged the configuration sections.
This section explains what metadata to keep (you seem to be under the impression these tags are removed?) when used with the optional allowlist configuration option.
I've also added language explaining why it is not a good idea to make this the default (future tags that are important for proper display not in this allowlist, breaking forward compatibility)
Thanks to Thibaud Colas for the feedback.
46df729 to
42ba825
Compare
Thanks to Thibaud Colas for reporting.
Note how WordPress did not change defaults and defers to plugins and configuration options.
Thanks to Thibaud Colas for the feedback.
42ba825 to
ce40859
Compare
Ultimate file size savings is not what I'm going after with this RFC - the estimated savings of a 1 or 2 KB at most for the majority of images are just too negligible to justify this, since I see unnecessary risk involved with going the 'strip all unless told not to' route. There may be (future) tags not in Wagtail's allowlist that are important for some reason (proper display, or have legal implications for site developers if not kept) that Wagtail should keep. It would be much simpler for site developers if they wouldn't have to fix Wagtail's overzealous defaults. And it would be simpler for us Wagtail maintainers if we did not have to update Wagtails allowlist of tags because image display is broken otherwise. I'm not worried about the Wagtail team having to deal with 'this or that extra tag to strip' over time. We have provided a mechanism for people to customize this themselves, which provides an 'escape hatch'. We can accept contributions to improve Wagtail defaults as we see fit. Hope this convinces you? 😄 |
|
It seems to me that the question of whether to keep or strip by default comes down to two things:
It's hard to imagine what use-cases there might be for EXIF tags that have a meaningful effect on the image (things like orientation and colour profile) that haven't been established already. If any do come up, then it's reasonable to expect that there will be a lot of existing software that doesn't support it, so Wagtail won't be seen particularly "at fault" - and it's not guaranteed that keeping the tag from the source image intact will be enough to support it. Conversely, it's not hard to imagine phone or camera manufacturers coming up with new schemes for metadata representing the person taking the photo, or location, or the type of camera, so I think it's far more likely that any unknown EXIF tags are of the unwanted kind. As for the consequences of getting it wrong - in the worst case, stripping a tag that should have been kept means that the final image will display incorrectly in some way. As mentioned, this broken behaviour is likely to be shared by plenty of other software, and be fairly minor (no phone/camera manufacturer would want to introduce a new feature that breaks a whole load of existing software). Most likely, it will be something that we fix in a patch release (or tell users to update in their configuration) and move on with no drama. However, keeping a tag that identifies a user / location that should have been private could, in extreme cases, be a matter of life or death. (Hopefully in those cases Wagtail isn't their only guard against that, but as per the "swiss cheese" model of security it's better to have that layer in place than not.) For those reasons, I would lean towards stripping unknown tags by default. |
|
Yep, same reasoning to lean towards stripping more by default. In the privacy scenario we should err on the side of stripping more than needed. For the compatibility scenario, I assume it will be simple to review those requests as they come in. And for the legal scenario I expect the projects with those requirements will be able to either decide to change the config for their project or make a feature request. I suppose we also could review what similar software does for this – even if the CMS options you reviewed don’t have this turned on by default @Stormheg, presumably some of the implementations have had to make similar choices and possibly receive the kind of feedback we’re discussing here over time? |
Default behaviourI came here to recommend that the safer default choice of processor would be The argument in favour of
|
| Event | Likelihood | Saturation needed to be significant | Harm | Allowlist must change | Blocklist must change |
|---|---|---|---|---|---|
| New EXIF tag needed for correct rendering | ? | High | Low | Yes | No |
| New EXIF tag is PII violation | ? | Low | High | No | Yes |
A parallel in Django is the recommendation to use fields rather than exclude when creating a ModelForm: https://docs.djangoproject.com/en/5.2/topics/forms/modelforms/#selecting-the-fields-to-use. It's a failsafe, security-by-default, approach, even if it adds a bit of friction for developers.
|
|
||
| This will be documented in release notes as a breaking change affecting users that rely on EXIF metadata. The default behaviour (stripping GPS and camera info) will be noted on guide.wagtail.org to make editors aware of the existence of EXIF metadata and how Wagtail handles it. | ||
|
|
||
| Existing images are unaffected until renditions are regenerated (via `manage.py wagtail wagtail_update_image_renditions`). Earlier behaviour can be restored by disabling EXIF processing in settings. |
There was a problem hiding this comment.
You discuss this being a breaking change. This seems reasonable, given enough warning and deprecation. I recommend that the release notes, and documentation give clear instructions on how to retain the current behaviour. I presume this would be to add the setting:
WAGTAILIMAGES_EXIF_PROCESSORS = []
Rendered text