INDEX
Explanations
references to controversial social and political issues, particularly those involving race and exploitation
New Auto-Interp
Negative Logits
philosoph
-0.15
ognito
-0.14
brag
-0.14
blas
-0.14
Walton
-0.14
Philosoph
-0.13
irez
-0.13
TED
-0.13
Modified
-0.13
buffered
-0.13
POSITIVE LOGITS
convenient
0.16
åζéĢł
0.16
nict
0.15
cad
0.15
playbook
0.15
handy
0.15
grievances
0.14
ichert
0.14
emot
0.14
icontrol
0.14
Activations Density 0.248%