INDEX
Explanations
phrases introducing alternative or clarifying information
expressions that reference alternative explanations or viewpoints
New Auto-Interp
Negative Logits
yip
-0.63
sson
-0.62
overcame
-0.61
iste
-0.60
akery
-0.60
iesta
-0.59
onis
-0.59
gets
-0.58
finally
-0.58
Accept
-0.58
POSITIVE LOGITS
worldly
1.39
words
1.05
respects
0.91
wise
0.85
words
0.84
contexts
0.83
vein
0.81
manner
0.79
jurisdictions
0.79
areas
0.78
Activations Density 0.025%