INDEX
Explanations
words indicating causation or reasoning
Follows first-person pronouns and discourse markers
New Auto-Interp
Negative Logits
(„
-0.71
(“
-0.67
(‘
-0.65
“...
-0.61
Vám
-0.61
„
-0.60
“[
-0.60
‘
-0.59
(!)
-0.58
‘‘
-0.56
POSITIVE LOGITS
uh
0.83
Uh
0.71
,}
0.70
um
0.68
,]
0.68
yeah
0.67
Portale
0.66
Okay
0.66
And
0.65
Yeah
0.65
Activations Density 0.104%