INDEX
Explanations
explicit mentions of "specific" things or topics
phrases indicating specificity in context
New Auto-Interp
Negative Logits
rican
-0.72
OWER
-0.71
mol
-0.67
Feldman
-0.65
911
-0.64
Springer
-0.64
Dinosaur
-0.62
http
-0.61
Neighbor
-0.61
Hilton
-0.61
POSITIVE LOGITS
ities
1.07
ally
1.01
ivity
0.92
itarian
0.91
arily
0.90
iveness
0.88
ivities
0.88
ality
0.83
iations
0.83
atively
0.82
Activations Density 0.007%