INDEX
Explanations
normalization of harmful speech and behavior
New Auto-Interp
Negative Logits
t
1.09
d
0.92
nem
0.83
g
0.82
lük
0.82
lots
0.80
لي
0.77
-
0.77
(
0.76
normal
0.75
POSITIVE LOGITS
normalized
1.16
normalization
1.11
normalised
1.05
normalizing
1.03
Normalize
1.03
normalize
1.01
ط
0.96
zione
0.93
normalize
0.89
Normalized
0.86
Activations Density 0.013%