INDEX
Explanations
phrases indicating "in other words."
New Auto-Interp
Negative Logits
apego
-0.67
yip
-0.66
atism
-0.64
overcame
-0.61
avorite
-0.59
outweigh
-0.58
icides
-0.58
Always
-0.57
iste
-0.56
achelor
-0.56
POSITIVE LOGITS
words
1.11
worldly
1.06
words
1.04
respects
0.94
wise
0.86
contexts
0.85
word
0.83
instances
0.80
circumstances
0.78
areas
0.77
Activations Density 0.016%