INDEX
Explanations
strong, negative actions or criticisms
instances of verbal aggression or confrontation in text
New Auto-Interp
Negative Logits
Notting
-0.74
orderly
-0.70
Transfer
-0.68
Genius
-0.67
Transform
-0.65
stable
-0.62
sterdam
-0.59
Alive
-0.59
safest
-0.58
kj
-0.58
POSITIVE LOGITS
accusing
1.00
against
0.91
leveled
0.84
jab
0.81
accuses
0.81
criticizing
0.81
critiques
0.80
critics
0.79
insults
0.79
against
0.78
Activations Density 0.203%