INDEX
    Explanations

    occurrences of manipulation or deception tactics

    New Auto-Interp
    Negative Logits
     gesteld
    -0.48
    seamnă
    -0.46
     nivelul
    -0.46
     geïsole
    -0.42
    zeiro
    -0.42
     inhoud
    -0.41
     iub
    -0.41
     zijne
    -0.40
     betrekking
    -0.40
     gogh
    -0.40
    POSITIVE LOGITS
     trick
    0.99
     tricks
    0.92
     STRATEGY
    0.88
     strategy
    0.83
     Tricks
    0.81
     Trick
    0.80
     cunning
    0.79
     STRATEG
    0.78
    trick
    0.78
     strateg
    0.78
    Act Density 0.533%

    No Known Activations