INDEX
Explanations
mentions of news sources and social media links
New Auto-Interp
Negative Logits
ÙĦس
-0.17
io
-0.16
hd
-0.15
abling
-0.15
meme
-0.15
-Smith
-0.15
ÏģÏį
-0.14
hydr
-0.14
gap
-0.14
268
-0.14
POSITIVE LOGITS
isas
0.20
Arg
0.16
Mirror
0.16
bosses
0.16
Crime
0.16
boss
0.15
abcdefghijklmnop
0.15
Crime
0.15
Mirror
0.15
crime
0.15
Activations Density 0.027%