INDEX
Explanations
personal pronouns and words related to personal experiences and beliefs
references to personal experiences and feelings related to everyday life
New Auto-Interp
Negative Logits
lyak
-0.77
gart
-0.68
odan
-0.61
reciproc
-0.59
unsurprisingly
-0.56
RAFT
-0.55
majorities
-0.55
winner
-0.55
YN
-0.55
proxy
-0.54
POSITIVE LOGITS
pires
0.95
imaginable
0.95
except
0.93
except
0.88
ãĤ¨ãĥ«
0.86
abilia
0.82
EVER
0.77
ever
0.75
ãĤ´
0.72
including
0.72
Activations Density 0.166%