INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
advis
-0.71
volunt
-0.70
phia
-0.69
Kron
-0.64
scen
-0.63
alike
-0.63
Classification
-0.63
beware
-0.63
cardinal
-0.62
itness
-0.62
POSITIVE LOGITS
kill
0.76
ih
0.71
ayed
0.71
afe
0.70
ragon
0.70
aha
0.70
borgh
0.70
paying
0.68
weet
0.67
func
0.66
Activations Density 0.000%
No Known Activations
This feature has no known activations.