INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
censorship
-0.69
spam
-0.69
xx
-0.65
mob
-0.64
ffect
-0.64
backdoor
-0.63
interven
-0.62
oldown
-0.61
ction
-0.60
llah
-0.60
POSITIVE LOGITS
Atl
0.93
Parables
0.93
Palest
0.90
angan
0.72
eport
0.71
FontSize
0.70
Orth
0.69
Catalan
0.66
vec
0.66
beit
0.66
Activations Density 0.000%
No Known Activations
This feature has no known activations.