INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
found
-0.92
haven
-0.81
duct
-0.79
ldon
-0.77
iHUD
-0.76
sonian
-0.76
tex
-0.75
cit
-0.71
netflix
-0.70
article
-0.69
POSITIVE LOGITS
typing
0.64
memor
0.63
badges
0.62
verification
0.61
reporting
0.61
verbs
0.60
Friendship
0.59
courage
0.58
metic
0.58
copying
0.58
Activations Density 0.000%
No Known Activations
This feature has no known activations.