Current GenAI trends in Info Retrieval & Publishing

Catherine Gracey

Today's Focus

1. Searching / Information Retrieval

Differences between different search tools
Considerations for using GenAI to search
Best practices

2. Publishing

Existing policies on GenAI use for authors
How publishers are using GenAI
Best practices

1. Information Retrieval

Searching for information is the primary way in which people are using GenAI [a]

In many cases, this is sufficient, but there are also some issues with relying on GenAI for information.

Before I explain, let's rewind and explore some of the different tools for finding scholarly information.

AI & Information Retrevial

Search using keywords
Results are only returned if keyword appears in text
No 'judgement' from system on what's relevant, it shows it all

Traditional Databases

"artificial intelligence" AND "diagnosis"

Search Engines

Search using natural language
Results that contain similar words are returned due to Machine Learning
Results are pre-sorted by perceived relevance

*the word diagnosis doesn't actually appear, but ML is used to determine that this is about diagnosis

NOT retrieving external sources
Generating output based on training data alone
Done by predicting which word should come next
Riddled with hallucinations

LLMs

Retrevial Augmented Generation (RAG)

Supplements LLMs with an external search
Results that contain similar words are returned due to Machine Learning
Outputs can be traced to specific sources
Now being incorporated into tools that were just previously LLMs (like ChatGPT), but the sources they can search vary widely

ScopusAI (an Academic Example)

Implications

	Databases	Search Engines	LLMs	RAG
Ease of search	Skill required	Easy	Easy	Easy
Ease of interpretation	Skill required	Moderate	Easy	Easy
Transparency	High	Moderate	Low	Low or Moderate
Reliability	High	Low	Low	Moderate

Why do transparency & reliability matter in information retrevial?

By handing off autonomy to these systems, we introduce the possibility for censorship & bias

If a system determines something isn't relevant, or search engine optomized, or is controversial, it simply doesn't present it to you, meaning you could only be getting half the story

Censorship:

Aaron Tay on Bluesky

Heavy Bias/political agendas:

[a]

Best Practices

Use tools that have RAG capabilities, and search Scholarly sources

Otherwise you might be trusting a Reddit user

1 Look for original sources

Make sure the original source contains what the GenAI output is saying it does (have hallucinations or misrepresentations occurred?)

2 Verify claims

Please don't cite OpenAI or Microsoft - cite the human authors who wrote the linked work (think of the h-index!!)

3 Cite human authors!

2. Publishing

AI-generated content has been showing up in the published literature for a while now

However, different publishers/journals have different policies, some indicating GenAI should be used very minimally or not at all

The current situation

The hype begins...

[a]

AI as "Author"

Accountability
Reliability
Credit for humans
Copyright?*

[a]

[b]

[a]

Tortured Phrases

[a]

Generally GenAI...

[a]

Can't be listed as an author
Should not be cited as an information source (cite the original source)
Use should be acknowledged in a statement
Should not be used for peer-review (major! privacy issues)

At the same time, publishers are embracing GenAI

[a]

[b]

Publishers are selling your work to AI Companies

For access to their data to train their AI models with:
- Microsoft paid Taylor & Francis $10 million [a]
- An unnamed company paid Wiley $23 million (with another $21 million coming soon) [b]
Many of these authors likely paid to get their work published (via APCs)

If publishers hold exclusive rights to your work, they have the authority to license it for various uses, including AI training, and financially benefit from these deals.

– Dede Dawson, 2024

Review Publishing/Copyright Agreements

Publishers can do this because authors have signed away the exclusive rights to their work in many cases
Authors did not explicitly consent to the sale of their work to AI companies, but had signed their rights away
There is very limited (or no) ability to opt out as authors

Best Practices

You don't want to finish an article only to realize you've accidentally broken a policy

1 Check your target journal BEFORE starting

If you are accused of using AI in a way you weren't supposed to, it can be helpful to have proof of work documents at the ready

2 Document everything

Get a second set of eyes on documents you're asked to sign

3 Stay informed about your rights as an author

Avoid deceptive practices

[a]

[b]

Current GenAI trends in Info Retrieval & Publishing

Today's Focus

1. Information Retrieval

AI & Information Retrevial

Traditional Databases

Search Engines

LLMs

Retrevial Augmented Generation (RAG)

ScopusAI (an Academic Example)

Implications

Censorship:

Heavy Bias/political agendas:

Best Practices

1

Look for original sources

2

Verify claims

3

Cite human authors!

2. Publishing

The current situation

The hype begins...

AI as "Author"

Tortured Phrases

Generally GenAI...

At the same time, publishers are embracing GenAI

Publishers are selling your work to AI Companies

Review Publishing/Copyright Agreements

Best Practices

1

Check your target journal BEFORE starting

2

Document everything

3

Stay informed about your rights as an author

Avoid deceptive practices

Thank You!

Questions?