INDEX
Explanations
references to social media, specifically Twitter
New Auto-Interp
Negative Logits
igham
-0.15
gings
-0.14
ONO
-0.14
otts
-0.14
769
-0.14
bum
-0.14
aland
-0.14
Hentai
-0.13
splash
-0.13
000
-0.13
POSITIVE LOGITS
ÑĢеб
0.16
pic
0.16
THREAD
0.15
Tweet
0.15
FACT
0.14
0.14
edn
0.14
pic
0.14
Tweet
0.14
amen
0.14
Activations Density 0.002%