INDEX
Explanations
pronouns 'it' and 'he' in sentences
the end of the document
New Auto-Interp
Negative Logits
hips
-0.60
itaire
-0.55
paran
-0.51
funer
-0.50
dding
-0.49
-----
-0.49
Friendly
-0.49
izable
-0.49
Priv
-0.49
idon
-0.49
POSITIVE LOGITS
zbollah
0.86
self
0.86
unes
0.83
chy
0.80
chwitz
0.79
alian
0.76
iner
0.76
asca
0.75
anium
0.72
seems
0.72
Activations Density 0.240%