INDEX
Explanations
references to family members and parental figures
New Auto-Interp
Negative Logits
itself
-0.20
ATUS
-0.15
ROME
-0.14
WARE
-0.14
æľ¬
-0.13
egot
-0.13
ôme
-0.13
asil
-0.13
ç®±
-0.13
ubi
-0.13
POSITIVE LOGITS
大人
0.18
-in
0.17
/gr
0.17
/legal
0.15
ilit
0.15
lessness
0.15
remar
0.14
ovny
0.14
ondo
0.14
dek
0.14
Activations Density 0.060%