INDEX
Explanations
proper nouns, particularly names and titles
New Auto-Interp
Negative Logits
faſt
-0.69
erop
-0.65
againſt
-0.65
itſelf
-0.64
iſt
-0.63
↵↵
-0.63
uſe
-0.62
preſent
-0.61
abstracta
-0.60
myſelf
-0.60
POSITIVE LOGITS
G
1.08
getM
1.00
setH
0.99
getB
0.98
W
0.98
O
0.97
S
0.97
M
0.96
getP
0.96
P
0.94
Activations Density 0.859%