INDEX
Explanations
proper nouns, specifically names of people
mentions of the name "Alex."
New Auto-Interp
Negative Logits
purpose
-0.70
enegger
-0.70
recy
-0.66
final
-0.65
discouraging
-0.64
liness
-0.64
coded
-0.64
%%
-0.63
manship
-0.63
draft
-0.62
POSITIVE LOGITS
iev
0.89
Anton
0.85
illo
0.84
inia
0.82
Koz
0.81
Alexander
0.80
andra
0.80
iants
0.79
azines
0.78
anian
0.78
Activations Density 0.012%