INDEX
Explanations
the word "call" followed by a high positive activation
instances of the word "call."
New Auto-Interp
Negative Logits
istg
-0.84
bilt
-0.81
embr
-0.70
inth
-0.69
bourne
-0.67
olitics
-0.66
ynski
-0.64
cffff
-0.61
ipeg
-0.61
inh
-0.60
POSITIVE LOGITS
backs
1.01
igraph
0.98
call
0.95
phas
0.86
bullshit
0.83
calling
0.81
911
0.81
Calling
0.80
bluff
0.78
oused
0.76
Activations Density 0.053%