Follow

Hi + + colleagues! Is there a common strategy for identifying XML-based formats? Should we try to identify the default namespace?

(Question triggered by a strange result brought by Siegfried: it identifies an OWL RDF/XML file as a teach2000 file in Wikidata (wikidata.org/wiki/Q105851165) because of a wrong signature from TrID...)

· · Web · 4 · 4 · 2

Mmmhhhh... I think I should ask this in the talk page of the Wikiproject Informatics...

Precision: the signature info from TrID is not "wrong", just not precise enough: many XML-based formats in are identified by a pattern "3C" at BOF (opening chevron).

This reminds me that Asger Askov Blekinge once suggested something along these lines in this 2011 (!) blog post:

openpreservation.org/blogs/new

I don't think this idea was ever really followed up or implemented in any format ID tools. Doesn't really answer your question, but might be interesting for historical reasons?

@bitsgalore Oh, nice!!! Seems a better solution (an alternative would be to treat XML files like any other binary format - for METS, for example, we could look for the <structMap> element, which is the only mandatory element - not anymore in METS 2 though 😭 ).

@anj @BertrandCaron @bitsgalore @apachetika I been unsuccessful at starting a few conversations around improving what you're getting from Wikidata. These docs I wrote might help with some aspects of that though.

github.com/richardlehane/siegf

@anj @BertrandCaron @bitsgalore @apachetika as for XML schema based identification then PRONOM can totally accommodate these up to a point, and does for some XML. Planets Core registry 2 IIRC correctly went further and added a new DB schema to store this information for this purpose.

@anj @BertrandCaron @bitsgalore @apachetika for binary identification you have small nuances to look out for (double or single quoted attributes for example). So parsing and validating the data more deliberately is perhaps less seductive and more likely to create more accurate results for XML with good schema docs.

@anj @BertrandCaron @bitsgalore @apachetika other than that, if you can correct the Wikidata record it'll be a massive help. Happy to discuss Wikidata issues more in the New Year. I touched upon it at iPRES and believe it should be in the panels released on video soon.

I have a "small" shopping list of Wikdiata issues. But no funding or professional time to commit to doing much more about it.

github.com/ross-spencer/WikiDP

@beet_keeper @anj @bitsgalore @apachetika
Thx Ross! But adopts different strategies (e.g., nationalarchives.gov.uk/pronom searches for the name of the root element while nationalarchives.gov.uk/pronom looks for the namespace). Should it be homogeneized or are these strategies dependent of the formats structure, which means the variation is acceptable?

@BertrandCaron @anj @bitsgalore @apachetika good questions. I personally think there are too many different ways of writing an XML document that leave the user exposed when relying on PRONOM style binary signatures. But if going the binary identification root, I'd be in favor of landing on a standard strategy that at least minimizes false-negatives, e.g. taking into account snippets (i.e. no header), variable position namespace attrs, quotations, etc.

@BertrandCaron @anj @bitsgalore @apachetika If a service like PRONOM had such a standard to refer to for those signatures, then if anything fell outside of the realms of possibility, could/would also be recorded in the record's metadata somewhere, e.g. "we can't write signature (X) like this, because the format has (Y) properties that make it difficult or impossible".

@beet_keeper @anj @bitsgalore @apachetika
OK. If we could come up with 1) a generic pattern for XML-based signatures and 2) systematic referencing of the namespace in the item, that would accomodate both approaches.

For 1), should we adopt an arbitrary offset from BOF to search for the namespace and root element name?

@beet_keeper @anj @bitsgalore @apachetika
In terms of processes, maybe we should stick with a strict approach of identification, which considers the root namespace extraction as part of the characterization step!

(Which is what we do at BnF, btw.)

In any case, there is work to do in . But I wanted to coordinate, in particular about the way we deal with partial data from TrID!

@BertrandCaron @anj @bitsgalore @apachetika from a DROID perspective, possibly Siegfried, you need one anchor at the beginning so something like <?xml. then there's a lot of flexibility what one can do after.

I wonder if we should try to discuss it on Wikidata somewhere? (or the PRONOM group?)

One Wikidata page could be a new topic here: wikidata.org/wiki/Property:P41

It might help define the property beter?

@beet_keeper @anj @bitsgalore @apachetika
Probably both!
We can bring the question to the next drop-in session (though I'll need half a gallon of coffee and aspirine to keep focused in such a discussion 😄 !)

@beet_keeper @anj @bitsgalore @apachetika
dpconline.org/news/pronom-drop
Every two Thursdays, 4pm ET, 5 CET, next should be on December 22nd but probably few people will show up... I'll probably participate on January 5th.

And there is another timeslot every month for Australasia.

@beet_keeper @anj @bitsgalore @apachetika I see that the Wikiproject:Informatics has more than 50 participants, so that it can't be pinged. So it may be better to post the question in its talk page: wikidata.org/wiki/Wikidata_tal.
If I can find some time this week I will launch the discussion (but feel free to do it anytime!)

@BertrandCaron @anj @bitsgalore @apachetika I saw someone replied!! I have tried adding to the discussion. It's not the easiest platform to type on. Likewise, perhaps need to correct/add to some of my questions/thoughts. But I had a bit of time to help get things moving hopefully.

@BertrandCaron @anj @bitsgalore @apachetika also, many thanks Bertrand. I'm glad to see someone opening up the discussions on Wikidata for this.

@anj @BertrandCaron @bitsgalore @apachetika

Y, no one looks at the results. LOL...

But seriously, over on #ApacheTika, our xml classifier looks only at the root node.

@BertrandCaron You might also want to check out this (which was partially inspired by Asger's post):

bitsgalore.org/2011/07/11/impr

There's also a link to a demo Python script. No idea if it still works because it's *ancient*, but possibly of some use. At the time I suggested it might be worth a try to implement this in Fido, but I don't think this ever happened.

@BertrandCaron
Depending on your example, it is hard to tell. From a quick test in Droid it comes out fmt/101

Sign in to participate in the conversation

Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.