INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     out
    -1.10
     one
    -1.09
     because
    -1.08
    álbum
    -1.06
     inoxid
    -0.99
    بيل
    -0.96
     userManager
    -0.95
     evidently
    -0.94
     anivers
    -0.94
     ristoranti
    -0.94
    POSITIVE LOGITS
     három
    1.37
     begrenzt
    1.30
     thre
    1.28
     két
    1.27
    ↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵
    1.24
     nieuwe
    1.23
     ,"
    1.22
    ↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵
    1.21
    ↵↵↵↵↵↵↵↵↵↵↵↵↵
    1.21
     ___________
    1.20
    Act Density 0.010%

    No Known Activations