Oh, apparently the Open Preservation Foundation is working on a validator for Open Document Format Spreadsheets, looks intriguing 📎 👀 https://openpreservation.org/events/the-experimental-kitchen-a-demonstrative-webinar/
It was fascinating talking to Leontien Talboom and Mark Bell about their work exploring computational access to the UK Government Web Archive, you can read the first part of our discussion here: https://blog.nationalarchives.gov.uk/exploring-computational-access-to-the-uk-government-web-archive-part-1/
The logic seems to be:
1) we don’t yet have a full understanding of the complexities of human language understanding;
2) large language models are also very complex, and therefore
3) you can’t prove they’re NOT doing basically the same thing, I am very smart
I wrote an FAQ in my hash collisions repository. Nothing too complex but it's long overdue: I was asked these questions *very* often.
Academic library vendor news
Clarivate - newish owner of Proquest and therefore Ex Libris, among others - is on track to make a 50% increase in profit this year compared to 2021.
Most of that appears to be from Proquest subscriptions. So, universities paying to buy back research literature and data they created in the first place.
I'm quite pleased with this essay about openness being about community. Would love to hear what you think.
(I should say that while it's just been published it was written a few months ago before Mastodon took off!)
It works by “Querying an OutbackCDX service, and using fastparquet to build up a copy of the data in Apache Parquet files. Then using DuckDB to query those files using SQL.”
ClueWeb22: 10 Billion Web Documents with Rich Information. (arXiv:2211.15848v1 [cs.IR]) http://arxiv.org/abs/2211.15848
Twitter! 🐥☠️ Archiving! 💾🗄️
As interest builds in archiving parts of Twitter, it’s a good time to check out Documenting the Now, a set of tools and guiding policies for archiving social media ethically, with guidance by communities at risk. It’s a years-long project led by archivists, who are specially trained to understand the ethical and practical implications of preservation. https://www.docnow.io
Scholar is built on an open, editable bibliographic catalog: https://fatcat.wiki
Most of the records are automatically imported from our wonderful upstream sources, but any human can directly submit corrections and additions through the web interface or API. These submissions are then reviewed in the open before merging. The entire catalog is versioned and can be downloaded in bulk or synchronized using a "changelog" feed.
You can learn more about editing at:
New blog post:
Archive your Tweets with Tweetback
Tweetback is built with @eleventy and I do think #Eleventy plays a special role here. Eleventy is a production ready, stable site generator that now has very concrete public proof of many projects with ~50,000 page builds (and even one in there with >118,000 pages—hi @nhoizey)
Tech lead for the UK Web Archive - data miner, digital preserver, entropy buster, partial physicist, geek.
Hometown is adapted from Mastodon, a decentralized social network with no ads, no corporate surveillance, and ethical design.