INDEX
Explanations
names of individuals or personalities
proper names of individuals, particularly those referenced in statements
New Auto-Interp
Negative Logits
sexual
-0.69
excludes
-0.68
impair
-0.63
fulfilling
-0.61
conformity
-0.61
grooming
-0.59
tumblr
-0.59
hath
-0.58
derailed
-0.58
damaged
-0.57
POSITIVE LOGITS
è¦ļéĨĴ
0.78
veland
0.75
sarcast
0.75
Ital
0.74
gloom
0.73
¿½
0.72
rhet
0.72
Ĭ±
0.68
Azerb
0.67
anca
0.67
Activations Density 0.282%