The flaky test problem usually comes from either race conditions (waiting for wrong things) or environmental differences. Adding AI vision on top often adds another layer of flakiness - now you're debugging "why did the model misread this button" on top of "why did the test timeout."
For mocking external services specifically - tools like MailHog (email) or mock OAuth providers tend to be more reliable than screenshot-based approaches. The determinism matters.
That said, if you genuinely need to test against production-like visual state - Playwright's screenshot comparison (toHaveScreenshot) combined with proper wait strategies has gotten pretty solid. The visual regression approach catches layout bugs that functional tests miss.