INDEX
Explanations
proper nouns, particularly names like "Andre" with varying levels of activation
occurrences of the name "Andre."
New Auto-Interp
Negative Logits
inct
-0.96
ulhu
-0.80
manship
-0.74
lishing
-0.73
stakes
-0.71
ointed
-0.68
plain
-0.67
yrinth
-0.65
lied
-0.65
lished
-0.65
POSITIVE LOGITS
tti
1.14
essen
0.91
byss
0.82
cats
0.78
Andre
0.78
Gord
0.72
Paste
0.71
XIII
0.69
aic
0.68
idis
0.67
Activations Density 0.023%