INDEX
Explanations
references to male and female pronouns, indicating discussions about specific individuals
pronoun followed by verb
New Auto-Interp
Negative Logits
featureID
-0.72
ロウィン
-0.65
IntoConstraints
-0.60
müſſen
-0.58
<=",
-0.57
メンテナ
-0.56
ddelwed
-0.56
Weiſe
-0.56
<unused14>
-0.56
<unused41>
-0.56
POSITIVE LOGITS
UnusedPrivate
0.43
enderror
0.40
Tembelea
0.35
tamment
0.32
AddTagHelper
0.32
flashdata
0.32
Nachfolger
0.32
tartalomajánló
0.32
nakalista
0.30
Predecesor
0.28
Activations Density 0.009%