INDEX
Explanations
reflections and thoughts about personal experiences and feelings
New Auto-Interp
Negative Logits
onn
-0.15
rone
-0.15
ime
-0.14
uc
-0.14
hap
-0.14
алом
-0.14
els
-0.13
nell
-0.13
von
-0.13
ith
-0.13
POSITIVE LOGITS
:
0.32
‘
0.31
oh
0.23
ok
0.23
“
0.22
hey
0.21
`
0.21
«
0.21
'
0.20
Oh
0.20
Activations Density 0.263%