INDEX
Explanations
proper nouns representing people
references to individuals or pronouns indicating people
New Auto-Interp
Negative Logits
feature
-0.71
features
-0.67
stem
-0.63
packed
-0.62
preparation
-0.61
enc
-0.61
primitive
-0.61
ends
-0.60
depress
-0.60
storage
-0.59
POSITIVE LOGITS
who
3.74
whose
2.47
Who
1.81
WHO
1.79
whom
1.78
who
1.72
how
1.49
where
1.43
Who
1.41
WHO
1.38
Activations Density 0.017%