INDEX
Explanations
phrases that indicate reasons, consequences, and benefits or harms
New Auto-Interp
Negative Logits
cloned
-0.16
εί
-0.15
alk
-0.14
Chow
-0.14
pand
-0.14
embed
-0.14
Proxy
-0.14
ted
-0.14
Grü
-0.14
Kaz
-0.13
POSITIVE LOGITS
asl
0.16
getti
0.16
irable
0.15
leo
0.15
emain
0.15
;o
0.15
>tag
0.15
sogar
0.14
.cv
0.14
jte
0.14
Activations Density 0.220%