There's also 2D rope for ViT, but I don't know how it works exactly.
This looks incredibly impressive as a result, but I'm wary of the use of metrics like FID to evaluate performance. I can take a high-res image, downsample it, then use the method and measure performance very easily: what percentage of pixels were correctly restored? Instead they're using metrics like FID which - while useful for purely generative techniques - seem a little vague for this purpose.
Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instruc-
tions to faithfully reproduce the main experimental results, as described in supplemental
material?
Answer: [No]
Justification: Although we have answered “No” for now, we intend to release the code and
models to enable the reproducibility of our main experimental results, pending approval
from the legal department. This temporary status reflects our commitment to open access
once all necessary permissions are secured.