32 pointsby zachperkitny5 hours ago2 comments
  • bobajeff4 hours ago
    I had to look up what KDL is and what `Functional Source License, Version 1.1, ALv2 Future License` is.

    So KDL is like another JSON or Yaml. FSL-1.1-ALv2 is an, almost but not really, open source license that after a 2 years becomes available under a real open source license. It's to prevent free loading from companies or something. Sounds fine to me actually.

    • zachperkitny3 hours ago
      Effectively, it's not meant to restrict people from using it, even in a commercial setting, just to protect my personal interests in what I want to do with it in a commercial setting.

      KDL is more than just JSON or YAML. It's node based. It's output in libraries is effectively an AST and its use cases are open ended.

  • zachperkitny5 hours ago
    Hello!

    I wanted to share my recent project: Tadpole. It is a custom DSL built on top of KDL specifically for web scraping and browser automation. I wanted there to be a standardized way of writing scrapers and reusing existing scraper logic. This was my solution.

    Why?

        Abstraction: Simulating realistic human behavior (bezier curves, easing) through high-level composed actions.
        Zero Config: Import and share scraper modules directly via Git, bypass NPM/Registry overhead.
        Reusability: Actions and evaluators can be composed through slots to create more complex workflows.
    
    
    Example

    This is a fully running example, @tadpole/cli is published on npm:

    tadpole run redfin.kdl --input '{"text": "Seattle, WA"}' --auto --output output.json

      import "modules/redfin/mod.kdl" repo="github.com/tadpolehq/community"
    
      main {
        new_page {
          redfin.search text="=text"
          wait_until
          redfin.extract_from_card extract_to="addresses" {
            address {
              redfin.extract_address_from_card
            }
          }
        }
      }
    
    
    Roadmap? Planned for 0.2.0

        Control Flow: Add maybe (effectively try/catch) and loop (while {}, do {})
        DOMPick: Used to select elements by index
        DOMFilter: Used to filter elements using evaluators
        More Evaluators: Type casting, regex, exists
        Root Slots: Support for top level dynamic placeholders
        Error Reporting: More robust error reporting
        Logging: More consistent logging from actions and add log action to global registry
    
    0.3.0

        Piping: Allowing different files to chain input/output.
        Outputs: Complex output sinks to databases, s3, kafka, etc.
        DAGs: Use directed acylic graphs to create complex crawling scenarios and parallel compute.
    
    Github Repository: https://github.com/tadpolehq/tadpole

    I've also created a community repository for sharing scraper logic: https://github.com/tadpolehq/community

    Feedback would be greatly appreciated!

    • bobajeff3 hours ago
      I like the idea of a DSL for scraping but my scrapers do more than extract text. I also download files (+monitor download progress) and intercept images (+ check for partial or failed to load images). So it seems my use case isn't really covered with this.
      • zachperkitny3 hours ago
        Thanks for the idea actually! It's difficult to cover every use case in the 0.1.0 release. I'll take this into account. Downloading Files/Images could likely be abstracted into just an HTTP source and the data sources could be merged in some way.