INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     even
    -1.08
     especially
    -1.07
     not
    -1.04
     vicin
    -1.03
     first
    -0.98
     huwa
    -0.97
     secretly
    -0.96
     only
    -0.95
    配慮
    -0.91
     how
    -0.91
    POSITIVE LOGITS
    1.03
     trám
    0.91
    Similar
    0.91
    —¡
    0.88
     Auflösung
    0.87
     Estas
    0.86
     két
    0.85
    0.85
     Posteriormente
    0.85
    asily
    0.83
    Act Density 0.004%

    No Known Activations