INDEX
    Explanations

    expressions of affection and admiration

    New Auto-Interp
    Negative Logits
    zelf
    -0.16
    agua
    -0.15
    indo
    -0.14
    obia
    -0.14
    اÙĦا
    -0.13
    earer
    -0.13
    antes
    -0.13
    boro
    -0.13
    lette
    -0.13
     arguably
    -0.13
    POSITIVE LOGITS
     how
    0.32
    how
    0.23
     ÙĥÙĬÙģ
    0.20
     cómo
    0.20
    æĢİä¹Ī
    0.19
    å¦Ĥä½ķ
    0.19
     hearing
    0.19
    eeee
    0.19
     rằng
    0.18
     everything
    0.18
    Act Density 0.038%

    No Known Activations