INDEX
Explanations
No Explanations Found
New Auto-Interp
Negative Logits
ãĥĨ
-0.71
ibr
-0.70
stream
-0.69
ood
-0.63
Cornell
-0.63
['
-0.63
yi
-0.62
Â
-0.62
ãĥı
-0.62
ãĥ«
-0.60
POSITIVE LOGITS
sacrific
0.84
challeng
0.83
comr
0.82
contrace
0.82
insula
0.81
comprom
0.80
Palestin
0.78
compe
0.77
condem
0.77
worldly
0.76
Activations Density 0.000%
No Known Activations
This feature has no known activations.