INDEX
Explanations
references to evil or malevolent entities and their influence on society
New Auto-Interp
Negative Logits
924
-0.15
nder
-0.15
allis
-0.15
succ
-0.15
anas
-0.14
aver
-0.14
599
-0.14
Metro
-0.14
Bod
-0.14
oin
-0.13
POSITIVE LOGITS
targeting
0.18
eldo
0.14
ục
0.14
ëĵĿ
0.14
/problems
0.14
ulen
0.14
control
0.14
inger
0.14
ä¸ģ
0.13
ublik
0.13
Activations Density 0.418%