PoC: Metadata implementation#574
Conversation
|
This now can emit email and json metadata (still yet to be extensively tested) More decisions made:
Footnotes
|
|
I'm currently downloading a corpus of data from PyPI to test this PR with. There's a lot to download so it won't be done anytime soon, but testing with what I have so far results in 282469 I'm digging into why exactly those had left-over keys, so far the most common reason is just bad data that can't be correctly parsed due to the new line problem I mentioned. An example of a problem That bar character was added by me, the other lines in that description are just I think this shows the strength of being very careful about how we deserialize data and the However, this PR returns a and the "leftover" data strcture looks like: which shows that there was an error parsing the Footnotes
|
|
Here's another one, this one is subtle: That produces a with a left overs of: It looks at some point maturin was emitting Footnotes
|
|
More weird bad data that normally passes silently: Body removed for brevity: with leftovers Looks like that file was emitted with a stray new line after the This is what |
|
So far all of the metadata files with leftover data I've investigated are due to the the |
|
My latest push starts rewriting the The new design has the following properties:
The way this PR's If you create a Metadata class using the normal constructor Thus, when you create a In addition to that, it supports a number of alternate constructors: The We add a Thus, we ensure that on access or writing to a property, that property's data is always valid from the POV of the user, but since we're doing it lazily, it allows partial validation if needed. The last part isn't written, but the plan is to also add a set of Footnotes
|
|
Overall, this design gives a lot of power, while ensuring a lot of safety and makes several pieces easier to implement:
This let's us serve a lot of different use cases:
On the design side, we've got very clear separation of concerns:
|
brettcannon
left a comment
There was a problem hiding this comment.
I'm up for going with this general approach. Should we try to get the raw parts in first to keep the PRs small?
|
|
||
|
|
||
| @enum.unique | ||
| class DynamicField(enum.Enum): |
There was a problem hiding this comment.
I'm wondering if it is worth sticking with an enum or just with lowercase string literals for the metadata field names? Same goes for known/supported metadata versions.
| # | ||
| # However, we want to support validation to happen both up front | ||
| # and on the fly as you access attributes, and when using the | ||
| # on the fly validation, we don't want to validate anything else |
There was a problem hiding this comment.
| # on the fly validation, we don't want to validate anything else | |
| # on-the-fly validation, we don't want to validate anything else |
| # purpose of RawMetadata. | ||
| _raw: RawMetadata | ||
|
|
||
| # Likewise, we need a place to store our honest to goodness actually |
There was a problem hiding this comment.
| # Likewise, we need a place to store our honest to goodness actually | |
| # Likewise, we need a place to store our honest-to-goodness, actually |
| # validated metadata too, we could just store this in a dict, but | ||
| # this will give us better typing. |
There was a problem hiding this comment.
| # validated metadata too, we could just store this in a dict, but | |
| # this will give us better typing. | |
| # validated metadata, too. We could just store this in a dict, but | |
| # this will give us better typing. |
| v2_3 = "2.3" | ||
|
|
||
|
|
||
| class _ValidatedMetadata(TypedDict, total=False): |
There was a problem hiding this comment.
So would this class have a key for each piece of metadata that we are willing to perform conversions/validation on from raw metadata?
| def full_validate(self, value: V | None) -> None: | ||
| if value is not None: | ||
| self.validate(value) | ||
|
|
||
| @abc.abstractmethod | ||
| def validate(self, value: V) -> None: |
There was a problem hiding this comment.
Why the two functions? Is it just to avoid having to deal with the None case for typing purposes?
| dynamic: List[str] | ||
|
|
||
| # Metadata 2.3 - PEP 685 | ||
| # No new fields were added in PEP 685, just some edge case were |
There was a problem hiding this comment.
| # No new fields were added in PEP 685, just some edge case were | |
| # No new fields were added in PEP 685, just some edge cases were |
|
|
||
|
|
||
| _EMAIL_FIELD_ORDER = [ | ||
| # Always put the metadata version first, incase it ever changes how |
There was a problem hiding this comment.
| # Always put the metadata version first, incase it ever changes how | |
| # Always put the metadata version first, in case it ever changes how |
| # class, some light touch ups can make a massive different in usability. | ||
|
|
||
|
|
||
| _EMAIL_FIELD_MAPPING = { |
There was a problem hiding this comment.
This scares me that there's a typo somewhere, but we would probably find out pretty quickly, so my brain wanting to do this as a dict comprehension just needs to calm down. 😅
| # This might appear to be a mapping of the same key to itself, and in many cases | ||
| # it is. However, the algorithm in PEP 566 doesn't match 100% the keys chosen | ||
| # for RawMetadata, so we use this mapping just like with email to handle that. | ||
| _JSON_FIELD_MAPPING = { |
There was a problem hiding this comment.
Don't need a dict comprehension.
|
@dstufft what would you like to do to move this forward? Implement the raw stuff with tests first? Something else? |
|
To help @dstufft move this forward, I have started my own branch that takes Donald's email header parsing code and begins to add tests and docs (this is currently a WIP w/ appropriate attribution to Donald via |
|
I think this PoC has outlived it's usefulness now with the work @brettcannon has been doing, so going to close it. |
This is still a work in progress and hasn't even been tested at all yet to even see if it runs or anything.
However, I think it should correctly implement parsing the
METADATAandmetadata.jsonfiles correctly, with I think the most lenient parser that I could come up with that doesn't silently let bad data go unnoticed.Some general notes about decisions made here (so far anyways):
RawMetadatalargely matches the format ofmetadata.json, but has some deviations to make using it better.metadata.jsonhave really awkward lack of pluralization (e.g.metdata.jsonhasclassifer: list[str], butRawMetadatahasclassifiers: list[str].RawMetadataare optional regardless of what the core metadata spec says.Project-URLmetadata is represented as adict[str, str]notlist[str].parse_FORMATfunctions return a tuple ofRawMetdatathat represents everything that could be parsed, and adict[Any, Any]that represents what could not be.parse_FORMATfunctions take either abytesor astr, allowing callers to give us a bytes and we will do the right thing, or if they know that their document is broken in some way with a wrong encoding, they can decode it themselves before passing it in.RawMetadataeven enforces) pass through silently.I tried to comment through everything, but there are a lot of subtle situations that I believe this will handle about as good as possible:
Project-URL, which conceptually is a map but due to RFC 822 not supporting maps, is serialized as a list of strings.METADATArepeat uses of a key that does not allow multiple uses makes that key unparseable and pushes it into the second structure.METADATAfile from havingNameemitted twice with different contents, which if we just blindly pick one of them as "the" value, different systems may randomly pick the other one, leaving two systems to parse the same file with different results.METADATA, such that a file that is mostly utf8, but one field has been mojibaked can still parse the bulk of the file correctly 1.Anyways, tomorrow I'm going to actually test this, and try to get serialization back into
METADATAandmetadata.jsondone, which should be a lot easier and less finnicky.Footnotes
The stdlib email parser plus RFC 822 together makes this horrible to do, because all of the parsing methods that accept bytes just do a hardcoded
decode("ascii", "surrogateescape"), which means that anyone parsingMETADATAwith one of the byte interfaces from the email library is incorrectly parsing validMETADATAfiles that contain any utf8 characters that aren't also ascii characters. ↩