Oh, apparently the Open Preservation Foundation is working on a validator for Open Document Format Spreadsheets, looks intriguing 📎 👀 openpreservation.org/events/th

does anyone have an example of a python script that connects to a globus collection and returns a list of folders?

It was fascinating talking to Leontien Talboom and Mark Bell about their work exploring computational access to the UK Government Web Archive, you can read the first part of our discussion here: blog.nationalarchives.gov.uk/e

The logic seems to be:
1) we don’t yet have a full understanding of the complexities of human language understanding;
2) large language models are also very complex, and therefore
3) you can’t prove they’re NOT doing basically the same thing, I am very smart

I wrote an FAQ in my hash collisions repository. Nothing too complex but it's long overdue: I was asked these questions *very* often.

*Latest addition to the community* the digipres chart busting track "Wheel Out the Digital Dark Age Klaxon" ... special edition release with background story, flac and mp4 files. Check it out here:

Academic library vendor news 

Clarivate - newish owner of Proquest and therefore Ex Libris, among others - is on track to make a 50% increase in profit this year compared to 2021.

Most of that appears to be from Proquest subscriptions. So, universities paying to buy back research literature and data they created in the first place.


Is there anything like "the basics of #IIIF for newies"? Big odds of taking part in a #digitisation project in #Spain and I'd like them to implement this framework.
Any #help or indications are welcome.

I'm quite pleased with this essay about openness being about community. Would love to hear what you think.


(I should say that while it's just been published it was written a few months ago before Mastodon took off!)

It works by “Querying an OutbackCDX service, and using fastparquet to build up a copy of the data in Apache Parquet files. Then using DuckDB to query those files using SQL.”

For here’s a little experiment in putting URL index data into a form you can run SQL queries on… github.com/anjackson/cdx-db

ClueWeb22 is released! It’s a dataset of 10 billion web pages in warc format with an accompanying paper.

Read more: idf.social/@ArxivIR/1094302830

Project page: lemurproject.org/clueweb22.php

ClueWeb22: 10 Billion Web Documents with Rich Information. (arXiv:2211.15848v1 [cs.IR]) http://arxiv.org/abs/2211.15848

Twitter! 🐥☠️ Archiving! 💾🗄️ 

As interest builds in archiving parts of Twitter, it’s a good time to check out Documenting the Now, a set of tools and guiding policies for archiving social media ethically, with guidance by communities at risk. It’s a years-long project led by archivists, who are specially trained to understand the ethical and practical implications of preservation. https://www.docnow.io

Scholar is built on an open, editable bibliographic catalog: https://fatcat.wiki

Most of the records are automatically imported from our wonderful upstream sources, but any human can directly submit corrections and additions through the web interface or API. These submissions are then reviewed in the open before merging. The entire catalog is versioned and can be downloaded in bulk or synchronized using a "changelog" feed.

You can learn more about editing at:

New blog post:

Archive your Tweets with Tweetback


Tweetback is built with @eleventy and I do think #Eleventy plays a special role here. Eleventy is a production ready, stable site generator that now has very concrete public proof of many projects with ~50,000 page builds (and even one in there with >118,000 pages—hi @nhoizey)

