INDEX
Explanations
Twitter URLs
instances of the letter 't' or related tokens
New Auto-Interp
Negative Logits
ĪĴ
-0.93
Dane
-0.86
Pigs
-0.71
Karma
-0.71
Desmond
-0.70
Decay
-0.70
Dull
-0.69
Wonderland
-0.69
tracts
-0.66
Corpus
-0.66
POSITIVE LOGITS
youtube
0.99
0.88
0.82
etsy
0.81
gallery
0.80
ileaks
0.79
yp
0.78
cher
0.77
orah
0.76
github
0.76
Activations Density 0.033%