140 pointsby jamesponddotcoa month ago15 comments
  • culia month ago
    I find WikiData to be perfect for aggregating identifiers. I mostly work with species names and it's perfect for getting the iNaturalist, GBIF, Open Tree of Life, Catalogue of Life, etc identities all in one query

    I haven't tried it for books. I imagine it's not sufficiently complete to serve as a backbone but a quick look at an example book gives me the ids for OpenLibrary, Librarything, Goodreads, Bing, and even niche stuff like the National Library of Poland MMS ID.

    https://www.wikidata.org/wiki/Q108922801

    • bbora month ago
      Cracks me up that OP is trying Anna's Archive before Wikidata, NGL! Both great sources, though.

      I recently (a year ago... wow) dipped my toe into the world of library science through Wikidata, and was shocked at just how complex it is. OP's work looks really solid, but I hope they're aware of how mature the field is!

      For illustration, here are just the book-relevant ID sources I focused on from Wikidata:

        ARCHIVERS: 
        Library of Congress Control Number    `P1144` (173M)
        Open Library                          `P648`  (39M)
        Online Computer Library Center        `P10832` (10M)
        German National Library               `P227`  (44M)
        Smithsonian Institute                 `P7851` (155M)
        Smitsonian Digital Ark                `P9473` (3M)
        U.S. Office of Sci. & Tech. Info.     `P3894`
      
        PUBLISHERS:
        Google Books                          `P675`  (1M)
        Project Gutenberg                     `P2034` (70K)
        Amazon                                `P5749`
      
        CATALOGUERS:
        International Standard Book Number    `P212`
        Wikidata                              `P8379` (115B)
        EU Knowledge Graph                    `P11012`
        Factgrid Database                     `P10787` (0.4M)
        Google Knowledge Graph                `P2671` (500B)
      • jamesponddotcoa month ago
        Not gonna lie, I didn't even know Wikidata existed until now. I'll look into it today and create a ticket for a new extractor.

        Thanks for letting me know!

      • jamesponddotcoa month ago
        Mind pointing me to an example book in Wikidata? I managed to find a few, but not if I search by ISBN, which makes it hard to find.

        Unless you mean the fact you can find the identifier for a book in several different websites in there, in which case, I did find it.

        • culi21 days ago
          you can search by ISBN or any other property of an item. You have to use a SPARQL query https://query.wikidata.org/

          I believe there's also a REST API somewhere if you really hate SPARQL

  • karldinga month ago
    Do you handle books with no ISBN?

    I’ve recently acquired some photo books that don’t appear to have any ISBN but are listed on WorldCat and have OCLC Numbers and are catalogued in the Japanese National Diet Library. Not sure if they actually don't have ISBNs or if I just haven't been able to find them, but from what I got from some research it's quite common for self-published books.

    • jamesponddotcoa month ago
      No, right now you need an ISBN to search for a book. At a later date I'll implement search by title or author, which should help with this use case.
  • wizzwizz4a month ago
    Please ensure that your database keeps track of whence data was obtained, and when. It's exceptionally frustrating when automated data ingesting systems overwrite manually-corrected data with automatically-generated wrong data: keeping track of provenance is a vital step towards keeping track of authoritativeness.
    • jamesponddotcoa month ago
      We don't support POST, PATCH, and whatnot yet so I didn't take that into account yet, but it's in the plans.

      Still need to figure out how this will work, though.

      • fc417fc802a month ago
        Since you support merging fields you likely would want to track provenance (including timestamp) on a per-field basis. Perhaps via an ID for the originating request.

        Although I would suggest that rather than merge (and discard) on initial lookup it might be better to remember each individual request. That way when you inevitably decide to fix or improve things later you could also regenerate all the existing records. If the excess data becomes an issue you can always throw it out later.

        I say all this because I've been frustrated by the quantity of subtle inaccuracies encountered when looking things up with these services in the past. Depending on the work sometimes the entries feel less like authoritative records and more like best effort educated guesses.

        • jamesponddotcoa month ago
          I’ll definitely discuss this with Drew, as he’s the one working on the database refactor. Thank you for the feedback!
          • wizzwizz4a month ago
            In my experience, designing a database schema capable of being really pedantic about where everything comes from is a pain, but not having done so is worse. As a compromise, storing a semi-structured audit log can work: it'll be slow to consult, but that's miles better than having nothing to consult, and you can always create cached views later.
  • apublicfroga month ago
    Wow. I don't have any use for this personally, but your post is really well presented, detailed and sourced. I hope it goes well!
    • jamesponddotcoa month ago
      Thanks! It's a compilation of several random comments I made for a few months, haha.
  • cmaurya month ago
    Are you able to pull upcoming titles? All I want is a weekly/monthly list of books by authors I've ready which are coming out, and I've not been able to find it or to build it.
    • culia month ago
      A couple years I looked for a similar service and failed to find it. I did however find this incredible podcast network called New Books In where they interview authors about their new books. It's a massive network that's broken down by categories that can get pretty niche. Everything from "Digital Humanities" to "Diplomatic History" to "Critical Theory". Episodes appear in multiple categories so broad categories like "Science" also exist

      https://newbooksnetwork.com/subscribe

      It's definitely biased towards academia which I personally see as a pro not a con

    • jamesponddotcoa month ago
      If the data is present in one of the extractors, yes, but I think only Amazon and similar stores keep this kind of data right now. We don't have a extractor for Amazon yet.

      After v1.0.0 is out I plan to add the ability to add books manually to the database, at which point we'll be able to start improving the database without relying on third-party services.

    • mackatsola month ago
      You can get an RSS feed from https://bookfeed.io with authors you want to track. Been using it for years at this point :-)
  • wredcolla month ago
    I applaud the effort, but last time I tried this the major issue was the sheer amount of book data only available from amazon.com and scraping that is tedious to put it mildly.
    • jamesponddotcoa month ago
      Hardcover and ISBNDB have a good amount of data, with Hardcover being excellent for getting good covers and genres.

      I'm hoping Goodreads and Anna's Archive will help fill in the gaps, especially since Anna's Archive have gigantic database dumps available[1].

      [1]: https://todo.sr.ht/~pagina394/librario/12

      • culia month ago
        You should also consider OpenLibrary and LibraryThing. Both of which have good coverage on WikiData which also aggregates identifiers.

        In fact, now that I think about it, you could also contribute your work to WikiData. I don't see ISBNdb ids on WikiData so you could write a script to make those contributions. Then anyone else using WikiData for this sort of thing can benefit from your work

        • jamesponddotcoa month ago
          I haven’t created a ticket for OpenLibrary yet, but it’s on my mental todo list. I’ll create the ticket for multiple new extractors today.

          I’d love to help improve other services. I plan on charging for Librario at some point, but I’ll offer a free version and offer free API keys for projects like Calibre and others.

          At least that’s the plan.

        • zozbot234a month ago
          ISBNid seems to use ISBN-13 for its unique identifier, which is just Property:P212 on Wikidata.
  • kricka month ago
    Does it handle languages other than English? I remember trying out some APIs like that for some tasks, and while I managed to find titles in English somewhat successfully, any other languages (be it the original title, or a translation of some fairly well-known book) were basically inaccessible.
    • jamesponddotcoa month ago
      I only tested English and Brazilian Portuguese so far, and Brazilian Portuguese worked, with translator information included.
      • fc417fc802a month ago
        You're most likely to run into issues with non-latin languages. Particularly picograms and the associated schemes for how to interpret them in a context sensitive manner. Substring search for example is likely to be broken in my experience.
  • ksra month ago
    Nice, I might try your API for my ISBN extractor / formatter at https://github.com/infojunkie/isbn-info.js

    Right now, I use node-isbn https://www.npmjs.com/package/node-isbn which mostly works well but is getting old in the tooth.

  • eichina month ago
    Tried throwing a batch of known-to-be-in-Amazon ISBN's through (from a recent "export my data", so even if they're old amazon fundamentally knows them.) Got 500's for a handful of the first hundred, then a bunch of 502/503s (so, single threaded, but part of the HN hug to death, sorry!)

    (Only the first 4 or so were json errors, the rest were html-from-nginx, if that matters.)

    • jamesponddotcoa month ago
      No hug of death, the server is sitting at 3% CPU usage under current load; it seems someone found a bug that triggered a panic, and systemd failed to restart the service because the PID file wasn't removed. Fixed now, should be back online :)
  • ocdtrekkiea month ago
    Library of Congress data seems like a huge omission especially for something named after a librarian. ;) It is a very easy API to consume too.
    • jamesponddotcoa month ago
      I didn't look into it yet because I assumed the current extractors had the information from them, but it's in my list of future extractors!
  • moritzrutha month ago
    What do you think about BookBrainz?

    https://bookbrainz.org/

    • jamesponddotcoa month ago
      First time I'm seeing it, to be honest, but it looks interesting. I do plan on having an UI for Librario (built a few mockups yesterday[1][2][3]), and I think the idea is similar, but BookBrainz looks bigger in scope.

      I could add them as an extractor, I suppose :thinking:

      [1]: https://i.cpimg.sh/pexvlwybvbkzuuk8.png

      [2]: https://i.cpimg.sh/eypej9bshk2udtqd.png

      [3]: https://i.cpimg.sh/6iw3z0jtrhfytn2u.png

      • nmstokera month ago
        This is great - the service and that you're extending it and considering a UI.

        Personally I would go with option 2 as the colour from the covers beats the anaemic feel of 1 and it seems more original than the search with grid below of 3.

        • jamesponddotcoa month ago
          Glad you liked the idea!

          Number two is what my wife and I prefer too, and likely what's going to be chosen in the end.

    • WillAdamsa month ago
      Doesn't seem to have a very compleat dataset --- the first book I thought to lok for, Hal Clement's _Space Lash_ (originally published as _Small Changes_) is absent, and I didn't see the later collection _Music of Many Sphere_ either:

      https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

  • mehdi1964a month ago
    Nice approach! Merging metadata from multiple sources is tricky, especially handling conflicts like titles and covers. Curious how you plan to handle scalability as your database grows—caching helps, but will the naive field strategies hold with thousands of books?
    • jamesponddotcoa month ago
      Right now the meeting happens on the fly and then is cached. In the future I imagine the finished merge will be saved as JSON to the database, depending on which is more expensive, the merging or a database call.

      Merging on the fly kinda works for the future too, for when data change or for when the merging process changes.

      No idea what the future will hold. The idea is to pre-warm the database after the schema has been refactored, and once we have thousands of books from that, I’ll know for sure what to do next.

      TLDR, there is a lot of “think and learn” as I go here, haha.

  • zvra month ago
    Would it be possible to use a SQLite file instead of a PostgreSQL instance? Or do you rely on some specific PostgreSQL functionality?
    • jamesponddotcoa month ago
      No, I decided pretty early on to make it database specific instead of more generic, so we do use some PostgreSQL features right now, like their UUIDv7 generation.

      But once the database refactor is done, I wouldn’t say no to a patch that made the service database agnostic.

  • omederosa month ago
    502 Bad Gateway :|
    • jamesponddotcoa month ago
      It seems someone found a bug that triggered a panic, and systemd failed to restart the service because the PID file wasn't removed. Fixed now, should be back online :)
  • sijiramaa month ago
    hella hella cool

    goodluck