INDEX
Explanations
references to the speaker or first-person perspective
New Auto-Interp
Negative Logits
partName
-0.70
vati
-0.69
irlf
-0.68
motivations
-0.66
motiv
-0.66
aturday
-0.65
motivating
-0.63
leys
-0.62
srfAttach
-0.61
motivated
-0.61
POSITIVE LOGITS
paraph
1.01
forget
0.92
typo
0.80
forgot
0.79
forgetting
0.79
dunno
0.71
swear
0.70
mean
0.69
LV
0.68
ours
0.67
Activations Density 0.572%