INDEX
Explanations
occurrences of manipulation or deception tactics
New Auto-Interp
Negative Logits
gesteld
-0.48
seamnă
-0.46
nivelul
-0.46
geïsole
-0.42
zeiro
-0.42
inhoud
-0.41
iub
-0.41
zijne
-0.40
betrekking
-0.40
gogh
-0.40
POSITIVE LOGITS
trick
0.99
tricks
0.92
STRATEGY
0.88
strategy
0.83
Tricks
0.81
Trick
0.80
cunning
0.79
STRATEG
0.78
trick
0.78
strateg
0.78
Activations Density 0.533%