INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
kindly
-0.69
DEBUG
-0.67
ourage
-0.66
brave
-0.66
handed
-0.66
ear
-0.65
gotten
-0.65
gaps
-0.64
ACL
-0.64
loopholes
-0.64
POSITIVE LOGITS
psc
0.77
̶
0.73
tumblr
0.73
OUP
0.68
JD
0.67
Mond
0.66
UA
0.66
oa
0.66
ITE
0.65
lez
0.64
Activations Density 0.000%
No Known Activations
This feature has no known activations.