INDEX
Explanations
references to advantages or positive outcomes
New Auto-Interp
Negative Logits
Lw
-0.72
שוליים
-0.68
dymyr
-0.66
betical
-0.63
Geiger
-0.63
yty
-0.60
userDao
-0.60
ráp
-0.58
mkdir
-0.58
stom
-0.58
POSITIVE LOGITS
benefits
2.29
Benefits
2.05
benefit
2.01
benefits
1.89
Benefits
1.88
Benefit
1.85
Benefit
1.85
benefit
1.83
BENEFITS
1.80
BENEFITS
1.69
Activations Density 0.072%