INDEX
Explanations
pronouns or nouns referring to people
references to people and their actions or experiences
New Auto-Interp
Negative Logits
Eleven
-0.75
Around
-0.67
wikipedia
-0.66
amaz
-0.65
ogue
-0.64
Greatest
-0.64
Deg
-0.63
Dayton
-0.62
icion
-0.62
aign
-0.61
POSITIVE LOGITS
knew
1.00
lacked
1.00
didn
0.98
've
0.95
feared
0.93
forgot
0.93
got
0.91
couldn
0.89
didnt
0.89
hadn
0.89
Activations Density 0.137%