INDEX
Explanations
proper nouns related to sports, entertainment, and journalism
New Auto-Interp
Negative Logits
category
-0.67
igators
-0.66
Newtown
-0.65
undreds
-0.64
κ
-0.60
Seym
-0.60
ousands
-0.59
ãĥ¼ãĥĨ
-0.58
iosyncr
-0.58
Mehran
-0.57
POSITIVE LOGITS
loves
0.96
's
0.94
acknowledges
0.93
hates
0.90
admits
0.89
knows
0.89
enjoys
0.88
concedes
0.87
vs
0.87
wrote
0.86
Activations Density 0.231%