INDEX
Explanations
phrases involving negation or refusal
New Auto-Interp
Negative Logits
sov
-0.16
ieux
-0.15
eva
-0.15
wy
-0.15
wan
-0.15
ries
-0.15
aeper
-0.14
ÙĤÙĤ
-0.14
hazi
-0.14
geois
-0.14
POSITIVE LOGITS
oundary
0.15
ãģĺ
0.14
ject
0.14
UpdatedAt
0.14
NavItem
0.14
립
0.14
dfd
0.13
penal
0.13
Blades
0.13
اÛĮر
0.13
Activations Density 0.082%