INDEX
    Explanations

    proper nouns, particularly names of authors or book titles

    New Auto-Interp
    Negative Logits
     كومونز
    -0.75
    Poznám
    -0.69
    pueden
    -0.65
     ricev
    -0.64
    enzuela
    -0.62
     ModelExpression
    -0.60
     pinak
    -0.59
    después
    -0.59
    algunos
    -0.57
     himo
    -0.57
    POSITIVE LOGITS
     subgoals
    0.56
     inappro
    0.51
     extré
    0.51
     sokak
    0.51
     célé
    0.49
     desnuda
    0.49
     Jr
    0.49
     Hitam
    0.49
     (@
    0.48
     vecteur
    0.48
    Act Density 0.308%

    No Known Activations