INDEX
Explanations
dialogue indicating agreement or acknowledgment
New Auto-Interp
Negative Logits
burn
-0.18
irie
-0.15
rient
-0.15
ouver
-0.14
pring
-0.14
Vent
-0.14
.opens
-0.14
ilter
-0.14
ourn
-0.13
tụ
-0.13
POSITIVE LOGITS
osto
0.17
ographics
0.15
edral
0.15
183
0.15
signals
0.15
grow
0.14
sat
0.14
889
0.14
auty
0.13
371
0.13
Activations Density 0.048%