INDEX
Explanations
phrases indicating causation or reasoning
New Auto-Interp
Negative Logits
avad
-0.18
ÏĦÎŃ
-0.16
ettes
-0.15
scopes
-0.15
nox
-0.15
.override
-0.15
ancode
-0.15
artial
-0.15
.djangoproject
-0.14
hausen
-0.14
POSITIVE LOGITS
745
0.19
965
0.17
797
0.16
_icons
0.15
964
0.15
zed
0.15
ared
0.15
ARED
0.15
815
0.14
verbatim
0.14
Activations Density 0.082%