INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ͅ
    -0.85
     Begin
    -0.81
     Tiên
    -0.80
    שְׁ
    -0.79
    -0.76
    zło
    -0.75
     Oats
    -0.75
     Pozna
    -0.75
    स्ट
    -0.75
     incorrect
    -0.75
    POSITIVE LOGITS
    Questão
    0.86
    👷
    0.77
     θα
    0.71
     Dash
    0.70
    考えると
    0.68
     Gou
    0.68
    EDWARD
    0.68
    ilia
    0.68
    くださった
    0.67
     these
    0.67
    Act Density 0.029%

    No Known Activations