INDEX
Explanations
phrases indicating causation or reasoning
New Auto-Interp
Negative Logits
ÂĿ
-0.16
ka
-0.16
nees
-0.15
/browse
-0.15
reau
-0.14
edium
-0.14
ãģŁãĤģãģ®
-0.13
ur
-0.13
cca
-0.13
yaw
-0.13
POSITIVE LOGITS
reasons
0.26
lack
0.23
being
0.22
sheer
0.20
its
0.19
how
0.18
proximity
0.17
ximity
0.17
limited
0.17
fears
0.17
Activations Density 0.066%