INDEX
Explanations
specific numbers or numerical information in the text
parentheses and similar punctuation
New Auto-Interp
Negative Logits
hid
-0.85
resur
-0.80
habit
-0.74
haunted
-0.74
tongue
-0.73
hiding
-0.73
personality
-0.72
bro
-0.72
breed
-0.71
firsthand
-0.70
POSITIVE LOGITS
excluding
1.89
including
1.68
average
1.66
minimum
1.64
approximately
1.62
depending
1.58
meaning
1.52
adjusted
1.50
total
1.49
according
1.48
Activations Density 0.126%