INDEX
Explanations
phrases that relate to community or social groups
New Auto-Interp
Negative Logits
strides
-0.64
incidental
-0.64
*/(
-0.63
ambers
-0.62
naire
-0.60
onne
-0.59
rums
-0.59
assertions
-0.58
absorb
-0.58
probabilities
-0.58
POSITIVE LOGITS
ours
0.76
rica
0.75
Humanity
0.70
Israel
0.68
Mine
0.68
hers
0.68
tnc
0.67
Charity
0.66
mine
0.65
Ju
0.65
Activations Density 0.032%