INDEX
Explanations
instances where someone is blocked or unblocked on social media platforms like Twitter
instances of the word "block" and its variations
New Auto-Interp
Negative Logits
subp
-0.73
mortar
-0.64
warr
-0.63
vapor
-0.62
capsule
-0.62
ppa
-0.60
chief
-0.59
appropri
-0.59
princ
-0.58
poster
-0.57
POSITIVE LOGITS
ances
0.97
ables
0.94
enged
0.93
ible
0.89
able
0.87
ãĥ¼ãĤ¯
0.85
hemy
0.83
ers
0.80
ement
0.77
zee
0.76
Activations Density 0.026%