INDEX
Explanations
mentions of the word "sister" at high activations
references to siblings, specifically sisters
New Auto-Interp
Negative Logits
veyard
-0.74
ered
-0.68
ustomed
-0.67
Frames
-0.67
ocalypse
-0.63
tarians
-0.63
ankind
-0.62
upuncture
-0.61
urations
-0.61
atility
-0.61
POSITIVE LOGITS
hood
1.11
sister
0.94
hips
0.92
heses
0.89
sisters
0.79
hesis
0.79
folk
0.79
fax
0.75
aunt
0.73
Sister
0.73
Activations Density 0.013%