INDEX
Explanations
phrases indicating excessive behavior or overreach
New Auto-Interp
Negative Logits
ussen
-0.16
iola
-0.15
stery
-0.15
rella
-0.15
dif
-0.14
yh
-0.14
.pref
-0.14
orget
-0.13
lectic
-0.13
ierge
-0.13
POSITIVE LOGITS
extreme
0.60
extremes
0.54
Extreme
0.53
Extreme
0.47
extrem
0.38
excess
0.35
excessive
0.34
extremism
0.34
極
0.29
extremists
0.28
Activations Density 0.231%