INDEX
Explanations
references to a specific entity or term "WP" with varying activations
references to a specific entity or group identified as "WP"
New Auto-Interp
Negative Logits
thus
-0.83
é¾įåĸļ士
-0.82
ante
-0.82
Reviewer
-0.79
thous
-0.75
hips
-0.74
erald
-0.73
taboola
-0.72
angelo
-0.72
tes
-0.71
POSITIVE LOGITS
WP
1.29
WP
1.28
olicy
1.05
Beg
0.81
FFER
0.74
witz
0.73
wordpress
0.72
LP
0.71
ITCH
0.71
ctive
0.71
Activations Density 0.005%