INDEX
Explanations
phrases indicating justification or excuses for behavior
New Auto-Interp
Negative Logits
kj
-0.15
ÙĤÙī
-0.14
.fhir
-0.14
ersiz
-0.14
eous
-0.14
aired
-0.14
verbatim
-0.14
trú
-0.14
.AnchorStyles
-0.13
eec
-0.13
POSITIVE LOGITS
measure
0.29
respect
0.28
accounts
0.27
sense
0.27
regards
0.27
stretch
0.26
extent
0.25
respects
0.25
measures
0.25
degree
0.24
Activations Density 0.048%