INDEX
Explanations
mentions of specific locations or institutional names
references to geographical regions and countries
New Auto-Interp
Negative Logits
ecause
-0.68
ruining
-0.62
taboo
-0.60
hindsight
-0.60
stripping
-0.59
boosting
-0.59
narrowing
-0.57
experimenting
-0.57
theless
-0.56
sparing
-0.56
POSITIVE LOGITS
.;
1.47
.''.
1.32
.).
1.26
.</
1.20
.,
1.17
.:
1.15
.
1.10
.}
1.08
./
1.07
;
1.07
Activations Density 0.463%