INDEX
Explanations
names of family members or relations
mentions of family relationships and their associated emotional contexts
New Auto-Interp
Negative Logits
obin
-0.73
verage
-0.61
iral
-0.60
ocial
-0.60
actual
-0.60
idates
-0.59
vals
-0.59
pected
-0.58
lust
-0.58
Coverage
-0.58
POSITIVE LOGITS
etc
0.83
76561
0.75
Weld
0.66
aka
0.65
FontSize
0.64
ModLoader
0.63
Mehran
0.63
umbn
0.63
whose
0.63
ĪĴ
0.62
Activations Density 0.474%