INDEX
Explanations
capitalized names indicating a specific individual
end-of-text tokens
New Auto-Interp
Negative Logits
unpre
-0.67
Crimean
-0.62
cers
-0.61
cardio
-0.61
Glory
-0.60
outgoing
-0.59
minded
-0.59
Staples
-0.57
sers
-0.57
Kuro
-0.56
POSITIVE LOGITS
ombie
1.51
ombies
1.48
ERO
1.39
ebra
1.28
odiac
1.27
imbabwe
1.27
ooming
1.16
oom
1.14
ealous
1.11
hao
1.09
Activations Density 0.028%