INDEX
Explanations
personal names and decisions within informative texts
phrases indicating personal opinions or decisions
New Auto-Interp
Negative Logits
.*
-0.82
.</
-0.71
manac
-0.70
.)
-0.60
.).
-0.60
.�
-0.59
().
-0.58
*.
-0.56
âĢł
-0.56
Sample
-0.56
POSITIVE LOGITS
warr
0.65
secondly
0.64
indemn
0.63
destro
0.62
alright
0.61
realise
0.61
apologise
0.60
hindsight
0.60
counselling
0.59
Secondly
0.59
Activations Density 0.975%