INDEX
Explanations
references to locations or places
repeated mentions of specific locations or places
New Auto-Interp
Negative Logits
pedal
-0.67
Ratio
-0.61
Arab
-0.61
Rod
-0.61
decap
-0.60
indoctr
-0.60
list
-0.59
determined
-0.59
arming
-0.59
lament
-0.59
POSITIVE LOGITS
oa
3.80
uu
1.80
ua
1.33
owa
1.21
oji
1.08
Gaga
1.00
ui
0.97
aho
0.95
oj
0.94
anta
0.94
Activations Density 0.007%