INDEX
Explanations
phrases indicating clarification or explanation
conditional statements and qualifications in the text
New Auto-Interp
Negative Logits
awoken
-0.76
anded
-0.72
accompan
-0.65
handled
-0.62
fed
-0.61
footed
-0.61
figured
-0.59
oru
-0.57
voic
-0.57
ļéĨĴ
-0.57
POSITIVE LOGITS
necessarily
0.88
exaggeration
0.86
anymore
0.81
exagger
0.76
lightly
0.71
dissu
0.71
imilar
0.71
azes
0.69
discouraged
0.69
anything
0.68
Activations Density 0.414%