INDEX
Explanations
percentage values
New Auto-Interp
Negative Logits
sw
-0.58
no
-0.55
dating
-0.55
forth
-0.54
taller
-0.53
cons
-0.52
office
-0.51
Japan
-0.51
sent
-0.50
long
-0.50
POSITIVE LOGITS
%.
3.79
%).
2.64
%,
2.62
%:
2.50
%;
2.48
%-
2.06
%
2.03
%"
1.90
%),
1.85
%]
1.79
Activations Density 0.006%