1 pointby saznlamsal3 hours ago1 comment
  • saznlamsal3 hours ago
    Hello HN, I’m the author of Ghost Engine.

    I built this to challenge the assumption that we are strictly bound by the "Memory Wall." My hypothesis was that modern consumer silicon (like Apple M-series) has enough spare compute to decompress weights procedurally faster than it can read them from RAM.

    The Architecture: It uses a "Predator-Prey" method:

    Predators: We identify and preserve the high-magnitude outliers (the "Alpha" weights) in FP16.

    Prey: The remaining weights are compressed into ternary masks {-1, 0, 1} and a block-wise scalar.

    Reconstruction: A bitwise kernel reconstructs the layer in L2 cache during the forward pass.

    Results: I validated this on Llama-3-8B (Layer 20, SwiGLU).

    Compression: ~3.0 bits per weight (effective).

    Fidelity: 0.915 Cosine Similarity (weights) / 0.912 (outputs).

    Size: Brings an 8B model down to ~3GB.

    The repo is currently a "Proof of Engine" in Python/MLX. The math works, but to realize the theoretical speed gains (125 t/s), I am working on porting the decompression kernels to Metal/CUDA.

    Happy to answer questions about the compression logic or the "Predator" selection algorithm!