INDEX
Explanations
instances of interactions with strangers
occurrences of the word "stranger."
New Auto-Interp
Negative Logits
rity
-0.83
aeda
-0.82
prus
-0.81
rix
-0.78
erb
-0.76
erenn
-0.75
chwitz
-0.72
amina
-0.72
REE
-0.71
inion
-0.71
POSITIVE LOGITS
stranger
0.90
liness
0.84
strangers
0.83
ishly
0.79
worldly
0.74
Colossus
0.71
Reincarn
0.68
whom
0.67
Stranger
0.66
Tuls
0.65
Activations Density 0.007%