Additions to also make it show info and save info.

This is a followup for: https://github.com/FamilySearch/GEDCOM/discussions/630#discussioncomment-12872971

**1: What I mean for info to grab:**

I have seen many times for GEDCOM7, that members of the steering committee and others wondered what GEDCOMs were around, and what was inside.
So with these additions to the MagiKey program it might be possible to figure that out. But it has to be extended then.
For this I am imagining MagiKey reads GEDCOMs, and creates 1 or more tables with info that is extracted form each GEDCOM.
Each time a GEDCOM is read, an entry is added to the tables, so they grow with each read.
The tables are internal and saved somewhere permanent.

Dont know if it is possible but it would be handy if we could detect if the same file is read in more times. That is to prevent the results to be out of line.
Maybe filename, date-time-sec of the file etc can be saved to be able to detect that. And to further prevent double reads, maybe create a hash for each file and save that.

Now things I think might be interesting to save and create overview reports from for your internal use. But maybe users are interested to, and the results should go online somewhere.

Lenght of the GEDCOM (in bytes or something) nr INDI's nr FAM, nr NOTE's etc. There will be GEDCOM's (from a Dutch program) that have no REPO's ever!!!

Date time sec, the GEDCOM was read.


1: Data about the GEDCOM, from the header.
I think all data found there can be saved withoud problems with privacy. Except maybe the note.
That way you know where all info came from, and you could sort data according to the creating program and such.
And you could also see how many programs use the PLAC.FORM in the header for instance. (has also been mentioned in the discussions)

2: Tag constructions. If there are TAG constructions that do not follow the spec, but contain the correct TAGs in the wrong sequence.

3: Submitters. Does a GEDCOM have a SUBM record, has it more than one, etc. So not really its contents as that is private, but just the fact if it has any and how many.

4: Extensions: Are there any, and which ones. Are they "explained" in the Header. (would give an idea about what programs really do use that, I think that was also mentioned in discussions.)
The link I started with in this post might help a bit, as the first link in that post pointed to an already existing excel file. Maybe that file could be used as a base for figuring out what to save)

So it might be a long table, but to be able to see what Extensions are "in the wild" they should all be saved in a table. (with a pointer to the file to track what programs use what extensions.)

5: What numbering system is used for the entities. Something like I001, I002, ... S001, S002 etc, or just 000, 002 


**2: The conversion doc.**
Maybe the output can be a Doc or a PDF or choosing between the 2.

Normal things like date time of the run, GEDCOM name and other info from its header etc.
Nr of entities, split in INDI, REPO etc. (so users can see if it fits with what their own program told them.)

The doc should mention if the program changed anything in the GEDCOM (correcting structures and such), together with the entity number so users can check.
It should mention any illegal TAGs and such, also with entity number.

**_So as a list:_**
Repaired lines. (with entity number) Maybe also what was changed into what.
Invalid TAGs, with the correction made (not according to the spec) with entity number
Invalid Tag-locations, with the correction made (not according to the spec) with entity number
Invalid names with the correction made (not according to the spec) with entity number
Invalid DATEs with the correction made. (not according to the spec) with entity number
Invalid PLACes. (not according to the spec) with entity number, with the correction made

And anything else you might encounter.
(These all come from the program I use, when it imports)

Very important:
It should not stop after 100 errors found. Maybe give the user a choise, but it should be possible to get all errors.


It would be great if you have testfiles from other programs, but I dont know if you have those. Otherwise maybe ask Albert for some files, and others that are developers of certain programs.


What I mean to say is, think carefully what might be interesting to save. As when you later have to add things, there might already be GEDCOM files read, and the results of those do not have the new things saved.

I dont know what the GEDCOM people are interested in, so what I type here is just ideas.

After the program is ready it should be available on a place people look, or search for. And maybe "advertise" a bit on forums of other programs.
If its hidden too much, it will not really collect enough data.

This is it for sofar. If I think of more I will add it.
(I'll send a PM with screenshots later)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additions to also make it show info and save info. #284

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Additions to also make it show info and save info. #284

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions