INDEX
Explanations
positive attributes and states
New Auto-Interp
Negative Logits
R
0.51
lf
0.48
הז
0.43
CT
0.41
N
0.39
K
0.39
Z
0.39
S
0.39
Y
0.38
J
0.38
POSITIVE LOGITS
on
0.50
in
0.49
对待
0.45
while
0.44
ѝ
0.42
enough
0.41
eness
0.41
selama
0.40
kwenye
0.40
على
0.40
Activations Density 0.144%