INDEX
Explanations
references to specific companies or organizations as well as words related to negative social behaviors
references to companies and antisemitism
New Auto-Interp
Negative Logits
Nile
-0.66
åĮ
-0.66
tails
-0.65
isen
-0.63
eering
-0.61
ogy
-0.60
heads
-0.59
Worlds
-0.59
itarian
-0.58
arty
-0.58
POSITIVE LOGITS
aurus
1.25
creen
1.23
ystem
1.20
earch
1.17
erver
1.16
ocial
1.12
ullivan
1.10
hiba
1.09
cript
1.05
omething
1.04
Activations Density 0.057%