INDEX
Explanations
words related to publishing and releasing information
New Auto-Interp
Negative Logits
rail
-0.87
usa
-0.81
cone
-0.75
ft
-0.70
ombat
-0.70
ï¸
-0.68
restling
-0.67
usp
-0.67
uay
-0.66
avery
-0.66
POSITIVE LOGITS
information
1.13
anything
1.03
ulate
1.01
transcripts
0.99
confidential
0.96
truthful
0.96
excerpts
0.95
details
0.94
inaccurate
0.94
updates
0.94
Activations Density 0.185%