Any thoughts? Should I default to what's in the product title instead of the unit count? Not sure the best way to combat this.
answer_initial = llm(prompt=prompt, site=site) # JSON with answer and any stuff needed to do heuristic checks.
heuristic_results = heuristics(answer_final) # rule based.
answer_final = llm(prompt-prompt, site=site, answer=answer_initial)
mark_for_review = ... # basically just a bunch of hard-coded stuff I add flag possible failures for review.
You can use an extremely small/cheap model for something like this -- granite 4.0 micro works fine for me, 3.3 8b did as well, both run on my macbook. YMMV / try different models and see how it goes.See also: toilet paper sheet count comparisons.
Consider the top four most expensive golf balls on your current list:
TaylorMade 2021 TP5x (3+1 Box) 4DZ Golf Ball Pack, White — uses 4DZ in title, 48.0 in unit count in product specs.
Bridgestone Golf Tour B RXS Quadfecta - nothing in the title, unit count in product specs is 4.0. This one shows 4 dozen in a different spot than other balls.
TaylorMade Golf 2024 TP5 Golf Balls 3+1 Box Four Dozen — Four dozen in the title, unit count in product specs is 1.0 but it has 4.0 dozen in the same div as the Bridgestone balls.
Srixon Z Star Yellow Golf Balls - Buy 2 DZ Get 1 DZ Free — Title shows buy 2 DZ get 1 free. That’s represented as 2+1 or 3+1 in other data. In product specs it shows a unit count of 1.0.
— In that extremely limited sample, the product weight is a pretty good metric to show that the unit count is flawed though that only works in comparison to others. I wonder if you could do a multi pass approach, where you sort data first and then do a unit count versus weight check to find outliers and then start rocking through the titles? You’ll still end up digging through a lot of edge cases and that won’t be much fun but a multi pass would at least give you some insight into those weird edge cases.
I'm thinking I could just start with any listing where unit count = 1 and take a pass at those first. I haven't looked yet, but I'm guessing single unit counts are almost always inconsistent with the actual number of golf balls.
If you click on the link you’ll see it’s says 4 dozen, 48 balls, on the box. And not just one.
I think your toggle idea is a good one though, and I'll look to implement that. I can see how some people might want that.
Yeah. I'm trying to figure out how to combat these inconsistencies. Right now, I have some manual overrides, but not sure it's sustainable to keep manually overriding inconsistent listings.
Any thoughts? Should I default to what's in the product title instead of the unit count? Not sure the best way to combat this.
The submitter's description does make reference to this a bit, the Amazon product description quantity for these items is "1"...
And it gets more complicated for the ones that are 2 dozen, plus 1 dozen 'free'...
The disk prices site does the exact same thing but the product is digital storage hardware. They made $50k from referrals to Amazon in 2024.
The disk prices site frustrates me because it illustrates so directly to costs imposed on the US from the current tariffs. I was able to get a 14tb disk from there DEC 2024 for $90 and now the cheapest is $220.
For determining the number of balls, i had an idea but not sure of how well it’d fit in. Could you feed the listing title, unit count, and description into an LLM with a basic “figure out how many balls are in this listing and make sure that number makes sense with the price” prefix prompt and then store that number with the ASIN? One LLM call per product should be pretty low cost, and it could automate a bunch of repetitive manual work