INDEX
Explanations
mentions of locations or entities in sentences
references to specific places, events, or classifications
New Auto-Interp
Negative Logits
dan
-0.93
663
-0.83
VII
-0.82
vi
-0.81
VE
-0.81
665
-0.80
uce
-0.80
Area
-0.79
463
-0.79
vit
-0.78
POSITIVE LOGITS
Bung
0.79
catentry
0.77
Back
0.77
Hicks
0.77
Mug
0.76
funnel
0.76
Haley
0.76
Gree
0.75
oola
0.75
Jung
0.75
Activations Density 0.319%