INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
ILA
-0.94
DRAG
-0.84
helicop
-0.79
paio
-0.76
olkien
-0.73
FontSize
-0.71
PDATE
-0.69
enthusi
-0.68
querque
-0.68
ACLU
-0.67
POSITIVE LOGITS
mouth
0.81
hof
0.69
pg
0.68
Goods
0.67
ady
0.67
Samar
0.63
psy
0.62
actor
0.61
ory
0.60
itary
0.60
Activations Density 0.000%
No Known Activations
This feature has no known activations.