INDEX
Explanations
references to a specific location ("Aber") within the context of a news article
New Auto-Interp
Negative Logits
atform
-0.88
iers
-0.81
TPS
-0.79
ipeg
-0.79
ership
-0.77
iets
-0.77
ingham
-0.75
ivity
-0.73
ergic
-0.72
iating
-0.71
POSITIVE LOGITS
ration
1.14
rant
1.00
rations
0.92
deen
0.86
thur
0.85
ansas
0.82
rants
0.78
rious
0.76
rated
0.73
odied
0.71
Activations Density 0.046%