INDEX
Explanations
negative stereotypes, especially important
New Auto-Interp
Negative Logits
radix
0.48
LE
0.44
LE
0.43
LEON
0.41
ARIES
0.41
ezers
0.40
рад
0.40
ERICK
0.40
rada
0.38
le
0.38
POSITIVE LOGITS
idey
0.40
Ideally
0.38
hani
0.38
issant
0.38
единен
0.37
ideally
0.36
ровании
0.36
intimate
0.36
bey
0.36
виправи
0.36
Activations Density 0.001%