INDEX
Explanations
phrases indicating alignment or conformity
phrases indicating alignment or conformity to standards or rules
New Auto-Interp
Negative Logits
livest
-0.65
Tacoma
-0.61
iens
-0.59
odor
-0.59
itch
-0.58
icides
-0.58
sacrific
-0.57
laun
-0.57
vertisements
-0.56
strugg
-0.56
POSITIVE LOGITS
with
0.75
arity
0.72
anthrop
0.68
vein
0.67
With
0.67
favour
0.66
omsky
0.63
cise
0.62
WITH
0.62
llah
0.62
Activations Density 0.053%