Here's an interesting discussion about why it's difficult to archive Facebook, at least with current web archiving tech.

FB's user interface is driven by HTTP POSTs to so the usual web archiving crawlers (which typically discover URLs and GET them) won't work. Archiving bots or tools like Webrecorder that load and interact with the DOM have more luck recording.

But even Webrecorder has trouble with playback because it needs to determine which response is appropriate in the archive for a particular interaction in the browser. The usual lookup of a record fails because the index it uses is URL based, and all the URLs are the same.

In the post Ilya mentions "fuzzy matching" which seems like a technique for looking up the correct response using the URL *and* the POST body. But the POST body could be altered at playback time in the browser by JavaScript. So a match might not be found--hence the need for a "fuzzy" match. At least that's how it seems to me...


@edsu yep that’s about it. TBH I think it’s likely deliberate obfuscation - they don’t like crawlers and they want everyone to get to FB through FB.

@anj nice, thanks Andy! Is the fuzzy matching kind of like a levenshtein distance type of thing? I guess I could take a peek in the code...

@edsu I’m not 100% sure TBH - lemme know what you find! 🙂

@anj it looks like fuzzy matching was started for HTTP GETs. Here are a bunch of rules that get compiled in:

Maybe I'm missing it or looking in the wrong place, but I don't actually see one for FB's GraphQL.

Sign in to participate in the conversation is a space for folks interested in productive conversations about, well, digital preservation! If you enjoy talking about how to do memory work with computers, or even with cardboard boxes of old photos, you belong with us on Many of us are/were Twitter users looking for an inclusive and community supported approach to social media. If any of these things sound good to you, consider joining us now.