@andrewjbtw Nice looking character though! 🔊
found the original hand-encoding in a set of photos provided by the donor
@andrewjbtw This looks like an amazing test case to code up for the bag utilities. Do you have a hex code for that encoding?
@nkrabben ran into similar issues again today. The problem with these situations is if I pipe 'ls' into xxd, I get different hex for the different filename representations.
@andrewjbtw I tried this against bagit-python (filename%2F.txt). Doesn't seem to trip it up, but still experimenting to make sure I'm making it terrible enough.
@nkrabben are you moving files between filesystems and operating systems? That's where I'm seeing the filenames become inconsistent with the manifest lines.
@nkrabben I'll try to put together a sample that causes the issue, with an example problem path like start on Mac, copy to linux server samba share
@andrewjbtw I just hacked it together the way I do control character problems (ctrl+v ctrl+(letter) in bash), but will try again once i have access to my windows machine again.
@nkrabben try maybe non-ASCII letters? Yesterday I found the main issue was an o with an umlaut.
@andrewjbtw Could it have anything to do with the default character encoding of your OS? It reminds me of this. https://medium.com/on-archivy/invisible-defaults-and-perceived-limitations-processing-the-juan-gelman-files-4187fdd36759
I don't fully understand what caused the problem. I think it was filenames generated on a system with CP-1252 character encoding and then being badly parsed by the Debian distro underlying BitCurator.
@nkrabben It's like that but this time entirely within utf-8. I used convmv to change from NFD to NFC, neither of which I'd heard of before.
@nkrabben but the weird character and others like it that started this thread isn't covered by that, probably because they're not real language characters, so convmv didn't help there.
@andrewjbtw Now reading wiki on unicode normalization
"However, they are not injective (they map different original glyphs and sequences to the same normalized sequence) and thus also not bijective (can't be restored)."
Yep, I'm in too deep.
@andrewjbtw Have you looked at the Unicode glyphs that might help? https://www.fileformat.info/info/unicode/char/1f508/index.htm If you can copy that line of text to clipboard can you paste it to a hex editor or other text editor to get away from the OS codepage?
digipres.club is a space for folks interested in productive conversations about, well, digital preservation! If you enjoy talking about how to do memory work with computers, or even with cardboard boxes of old photos, you belong with us on digipres.club. Many of us are/were Twitter users looking for an inclusive and community supported approach to social media. If any of these things sound good to you, consider joining us now.