INDEX
    Explanations

    references to friendliness and positive social interactions

    New Auto-Interp
    Negative Logits
    ngo
    -0.17
    stin
    -0.15
    lage
    -0.14
    åŁŁ
    -0.14
    238
    -0.14
    ilion
    -0.14
    edBy
    -0.14
    orial
    -0.14
    sf
    -0.14
    à¥Īà¤ľ
    -0.14
    POSITIVE LOGITS
    /lo
    0.17
     towards
    0.17
     toward
    0.16
    nature
    0.16
     faces
    0.16
    udge
    0.16
     nature
    0.16
    /help
    0.16
     tone
    0.15
    -faced
    0.15
    Act Density 0.073%

    No Known Activations