INDEX
Explanations
terms related to societal structures, power dynamics, and social values
New Auto-Interp
Negative Logits
ⓘ
-0.66
[])
-0.59
******/
-0.54
[]){-0.53
"]));
-0.53
()])
-0.53
'][]
-0.52
[])
-0.51
[]
-0.51
,:),
-0.50
POSITIVE LOGITS
always
1.00
always
0.90
siempre
0.85
alone
0.82
needn
0.81
often
0.79
usually
0.79
всегда
0.76
itself
0.75
と聞
0.75
Activations Density 0.737%