INDEX
Explanations
words related to implications or suggestions
words related to implications or suggesting conclusions
New Auto-Interp
Negative Logits
eret
-0.76
"},{"-0.74
foot
-0.68
surfing
-0.67
ryu
-0.67
vre
-0.66
igo
-0.65
erate
-0.64
SEA
-0.64
meter
-0.63
POSITIVE LOGITS
impl
3.96
impl
2.22
Impl
1.60
Impl
1.32
collapse
1.16
unravel
1.07
collapsing
1.05
expl
1.05
expl
1.04
collapses
0.95
Activations Density 0.019%