INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    мою
    -0.83
     freezes
    -0.79
     recupero
    -0.75
    زوج
    -0.75
    正直
    -0.73
     Sympathi
    -0.71
    ดง
    -0.70
     vĩnh
    -0.69
     alluminio
    -0.69
    trainable
    -0.68
    POSITIVE LOGITS
     safety
    2.61
     security
    2.44
     sécurité
    2.16
     Security
    2.11
     Safety
    2.08
    Safety
    1.97
    safety
    1.96
    security
    1.94
     seguridad
    1.92
     SAFETY
    1.84
    Act Density 0.011%

    No Known Activations