INDEX
Explanations
references to human roles and occupations, specifically in arts and media contexts
New Auto-Interp
Negative Logits
t
-0.46
ti
-0.37
ez
-0.36
eer
-0.36
tim
-0.36
tors
-0.35
tur
-0.35
tin
-0.35
ted
-0.34
tor
-0.34
POSITIVE LOGITS
rier
0.27
rr
0.26
ship
0.24
ra
0.23
de
0.23
riers
0.23
riage
0.22
ium
0.22
iginal
0.22
red
0.21
Activations Density 0.346%