INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
haar
-0.90
mble
-0.86
ptoms
-0.84
llah
-0.84
schild
-0.84
gaard
-0.79
restricted
-0.78
beit
-0.76
reau
-0.74
burst
-0.74
POSITIVE LOGITS
policy
1.17
policies
1.00
Policy
0.84
Policies
0.77
policy
0.73
Mayo
0.69
Doodle
0.67
Barbie
0.66
Ike
0.66
Ellison
0.66
Activations Density 0.000%
No Known Activations
This feature has no known activations.