INDEX
Explanations
phrases that refer to specific cases or instances
New Auto-Interp
Negative Logits
lie
-0.17
wag
-0.15
egas
-0.15
arrants
-0.15
king
-0.15
sb
-0.15
sel
-0.14
fuse
-0.14
Kingdom
-0.14
ikh
-0.14
POSITIVE LOGITS
case
0.16
ulary
0.15
rowave
0.15
isphere
0.14
ipline
0.14
Sharper
0.14
uais
0.14
coon
0.14
-Mart
0.14
zik
0.14
Activations Density 0.089%