INDEX
Explanations
conversational phrases and comments directed at the audience
New Auto-Interp
Negative Logits
Woman
-0.15
woman
-0.15
vet
-0.14
newcomer
-0.14
zel
-0.14
Ihrer
-0.14
fucking
-0.14
man
-0.13
cox
-0.13
ponent
-0.13
POSITIVE LOGITS
folks
0.51
guys
0.42
fol
0.39
Fol
0.38
ladies
0.33
everybody
0.32
everyone
0.32
folk
0.30
friends
0.30
Guys
0.28
Activations Density 0.155%