INDEX
Explanations
phrases indicating strong emotions or opinions, often in a negative context
instances of rants and violent outbursts
New Auto-Interp
Negative Logits
undai
-0.88
metics
-0.86
tarians
-0.75
ierrez
-0.72
rity
-0.71
cius
-0.71
liv
-0.68
Orchestra
-0.66
Assistance
-0.66
tarian
-0.65
POSITIVE LOGITS
tir
0.86
vengeance
0.84
atical
0.82
against
0.78
quit
0.76
spree
0.74
rampage
0.73
AGA
0.73
¯¯¯¯
0.73
rage
0.71
Activations Density 0.087%