INDEX
Explanations
single uppercase letters or acronyms in a particular context
capital letters or proper nouns
New Auto-Interp
Negative Logits
)=(
-0.79
hire
-0.78
bent
-0.73
xon
-0.73
negie
-0.72
REDACTED
-0.67
come
-0.64
crim
-0.64
orio
-0.64
bring
-0.63
POSITIVE LOGITS
umps
0.94
ucks
0.85
ippers
0.82
oses
0.79
ixture
0.78
enses
0.77
umper
0.76
leep
0.75
agging
0.75
oots
0.74
Activations Density 0.149%