INDEX
Explanations
specific words in a text, focusing on words rather than the context or structure of the sentences
New Auto-Interp
Negative Logits
DERR
-0.96
roxy
-0.83
taboola
-0.77
©¶æ¥µ
-0.75
psey
-0.75
abama
-0.72
ersen
-0.72
rero
-0.71
ahon
-0.70
Democr
-0.70
POSITIVE LOGITS
mith
1.17
sworth
1.08
ptr
0.93
ifier
0.89
press
0.82
diction
0.80
phrases
0.79
uttered
0.79
words
0.79
processor
0.78
Activations Density 0.040%