Current GenAI trends in Info Retrieval & Publishing

Catherine Gracey

Today's Focus

1. Searching / Information Retrieval

  • Differences between different search tools
  • Considerations for using GenAI to search
  • Best practices

2. Publishing

  • Existing policies on GenAI use for authors
  • How publishers are using GenAI
  • Best practices

1. Information Retrieval

Searching for information is the primary way in which people are using GenAI [a]

In many cases, this is sufficient, but there are also some issues with relying on GenAI for information.

Before I explain, let's rewind and explore some of the different tools for finding scholarly information.

AI & Information Retrevial

  • Search using keywords
  • Results are only returned if keyword appears in text
  • No 'judgement' from system on what's relevant, it shows it all

Traditional Databases

"artificial intelligence" AND "diagnosis"

Search Engines

  • Search using natural language
  • Results that contain similar words are returned due to Machine Learning
  • Results are pre-sorted by perceived relevance

*the word diagnosis doesn't actually appear, but ML is used to determine that this is about diagnosis

  • NOT retrieving external sources
  • Generating output based on training data alone
  • Done by predicting which word should come next
  • Riddled with hallucinations

LLMs

Retrevial Augmented Generation (RAG)

  • Supplements LLMs with an external search
  • Results that contain similar words are returned due to Machine Learning
  • Outputs can be traced to specific sources
  • Now being incorporated into tools that were just previously LLMs (like ChatGPT), but the sources they can search vary widely

ScopusAI (an Academic Example)

Implications

Databases Search Engines LLMs RAG
Ease of search Skill required Easy Easy Easy
Ease of interpretation Skill required Moderate Easy Easy
Transparency High Moderate Low Low or Moderate
Reliability  High Low  Low Moderate

Why do transparency & reliability matter in information retrevial?

By handing off autonomy to these systems, we introduce the possibility for censorship & bias

If a system determines something isn't relevant, or search engine optomized, or is controversial, it simply doesn't present it to you, meaning you could only be getting half the story

Censorship:

Heavy Bias/political agendas:

[a]

Best Practices

Use tools that have RAG capabilities, and search Scholarly sources 

Otherwise you might be trusting a Reddit user

1

Look for original sources

Make sure the original source contains what the GenAI output is saying it does (have hallucinations or misrepresentations occurred?)

2

Verify claims

Please don't cite OpenAI or Microsoft - cite the human authors who wrote the linked work (think of the h-index!!)

3

Cite human authors!

2. Publishing

AI-generated content has been showing up in the published literature for a while now

 

However, different publishers/journals have different policies, some indicating GenAI should be used very minimally or not at all

The current situation

The hype begins...

[a]

AI as "Author"

  • Accountability
  • Reliability
  • Credit for humans
  • Copyright?*

[a]

[b]

[b]

[a]

Tortured Phrases

[a]

Generally GenAI...

[a]

  • Can't be listed as an author
  • Should not be cited as an information source (cite the original source)
  • Use should be acknowledged in a statement 
  • Should not be used for peer-review (major! privacy issues)

At the same time, publishers are embracing GenAI

[a]

[b]

Publishers are selling your work to AI Companies

 

  • For access to their data to train their AI models with:

    • Microsoft paid Taylor & Francis $10 million [a]

    • An unnamed company paid Wiley $23 million (with another $21 million coming soon) [b]

  • Many of these authors likely paid to get their work published (via APCs)

If publishers hold exclusive rights to your work, they have the authority to license it for various uses, including AI training, and financially benefit from these deals.

– Dede Dawson, 2024

Review Publishing/Copyright Agreements

 

  • Publishers can do this because authors have signed away the exclusive rights to their work in many cases

  • Authors did not explicitly consent to the sale of their work to AI companies, but had signed their rights away

  • There is very limited (or no) ability to opt out as authors

Best Practices

You don't want to finish an article only to realize you've accidentally broken a policy

1

Check your target journal BEFORE starting

If you are accused of using AI in a way you weren't supposed to, it can be helpful to have proof of work documents at the ready

2

Document everything

Get a second set of eyes on documents you're asked to sign

3

Stay informed about your rights as an author

Avoid deceptive practices

[a]

[b]

Thank You!

Questions?