INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
sqor
-0.80
habi
-0.76
halftime
-0.71
merga
-0.67
capit
-0.65
antibody
-0.64
membr
-0.64
ortmund
-0.63
clos
-0.62
antibodies
-0.62
POSITIVE LOGITS
Variable
0.74
Dying
0.74
xp
0.69
Paper
0.66
Activ
0.62
Cy
0.62
Vanity
0.62
401
0.61
tc
0.61
rek
0.60
Activations Density 0.000%
No Known Activations
This feature has no known activations.