37 pointsby PaulHoule9 days ago5 comments
  • visarga5 days ago
    I usually annotate my chunks with title, summary, keywords and multiple levels of hierarchical topics. More recently I thought about annotating intent, user values and tactics, especially for debate related text and LLM chats. So I would annotate state (the summary of the content itself), values (intent and values) and policy (tactics), taking inspiration from RL.

    The idea of detecting frames and using them to tease out the implicit meaning from text is quite nice. It seems there is a lot more to discover about using LLMs prior to RAG. Text is like code, you can't know what it does untill you run it, and in this case, until you annotate it. For example "10+10" won't embed close to "20". And "The fifth letter in this string" won't retrieve "f" by emmbedding similarity.

    • nthingtohide5 days ago
      I have this very stupid question and I haven't seen it answered anywhere.

      Let's say LLM while training is ingesting a very long book.

      The name of the author of the book would appear at the very beginning.

      While inference, how does the llm determine that the last chapter of the book is written by so and so author and hence that chunk should be near that author's style.

      • visarga5 days ago
        The information is probably lost, training on very long inputs is expensive, we usually train on short inputs and only send longer ones at the end. So the author name would not appear together with all the chunks unless it appears on every page in the header, assuming the headers are not stripped out first.
        • nthingtohide4 days ago
          I am wondering the same techniques we use to augment chunks for RAG also the same technique used to augment training data with rich metadata. A preprocessing step which ensures metadata rich text for creating foundational models.
      • KTibow5 days ago
        It might not directly learn it, but it could still identify it if it's quoted by others or if there's enough text of the same style with the author's name as context.
  • simonw5 days ago
    I had to learn what a "frame" is to understand this. https://framenet.icsi.berkeley.edu/ is useful (FrameNet is the collection of 1,000 frames used in the paper).

    An example of a frame is an "Event" - https://framenet.icsi.berkeley.edu/fnReports/data/frameIndex... - where:

    > An Event takes place at a Place and Time.

    So if you're extracting frames from a piece of text, that's one of the concepts you might be trying to identify - along with what the place and time are.

  • tsunego5 days ago
    Cool paper but unsurprising results since anything benefits from RAG
    • 0xdeadbeefbabe5 days ago
      Also, grep is really exciting with lots of data.
  • bttrpll4 days ago
    Love FrameNet. Awesome to see it's alive and in-use.
  • andrewmutz5 days ago
    From the paper:

    > Frames ... are conceptual structures that capture the semantic and syntactic relationships underlying language. They are helpful in providing a structured semantic context for understanding relationships between entities, enabling tasks like Machine Reading Comprehension and Information Extraction to be more accurate and contextually aware.