INDEX
Explanations
phrases indicating alignment or agreement
phrases that indicate alignment or agreement with policies or standards
New Auto-Interp
Negative Logits
©¶æ¥µ
-0.86
chief
-0.77
quer
-0.77
hops
-0.73
orah
-0.73
Chat
-0.71
chat
-0.71
agg
-0.71
raq
-0.64
borgh
-0.64
POSITIVE LOGITS
regards
1.12
regard
1.10
respect
0.97
impunity
0.87
stood
0.86
draw
0.82
rium
0.80
drawn
0.75
intact
0.71
dignity
0.69
Activations Density 0.111%