INDEX
    Explanations

    describes actions or states

    New Auto-Interp
    Negative Logits
    1.45
    U
    1.16
    F
    1.12
    AL
    1.09
    O
    1.06
    }
    1.00
    },
    0.98
    У
    0.96
    ER
    0.95
    }$
    0.93
    POSITIVE LOGITS
    س
    1.16
    ти
    1.07
    िया
    1.05
    0.98
    は何
    0.97
    товые
    0.94
    ឱ្យ
    0.93
    те
    0.93
    सिया
    0.93
    ルス
    0.91
    Act Density 0.181%

    No Known Activations