Troubleshooting using Jenkins Script Console

Troubleshooting using Jenkins Script Console

Marco Davalos

Software Developer

Platform Engineering Support Team

Alvaro Iradier

Lead Architect

Slaves

???????

Jobs

Jenkins features a nice Groovy script console which allows one to run arbitrary Groovy scripts within the Jenkins master runtime or in the runtime on agents.

Jenkins' Documentation

def agents = Jenkins.instance.nodes.size()
println "${agents} Agents"

def jobs = Jenkins.instance.getAllItems(Job.class).size()
println "${jobs} Jobs"

Slaves

18.000

Jobs

java.nio.file.FileSystemException: 
No space left on device

/var/jenkins_home
|
|_ config.xml
|
|_ jobs
| |_ [JOB_NAME]
| | |_ config.xml
| | |_ builds
| |  |_ 1
| |  |_ 2
| |  |_ 3
| |  |_ ...
| |_ ...
|
|_ plugins
|
|_ ...

Encourage developers to use "Discard old build" option (and pray)

Solutions

Remove unused plugins

Get more disk space (only if needed)

and...

import jenkins.model.Jenkins
import hudson.model.Job

int MAX_BUILDS = 5 // max builds to keep

Jenkins.instance.getAllItems(Job.class).each { job ->

    job.builds.drop(MAX_BUILDS).findAll {

        !it.keepLog &&
        !it.building &&
        it != job.lastStableBuild &&
        it != job.lastSuccessfulBuild &&
        it != job.lastUnstableBuild &&
        it != job.lastUnsuccessfulBuild

    }.each { build ->
        build.delete()
    }
}

builds_cleaner.groovy

Requirements

- Schedulable execution
- Don't wait until it's on fire to clean-up

CRUMB=$(curl --user $USER:$TOKEN \   # CSRF protection
    '$JENKINS_URL/crumbIssuer/api/xml?xpath=concat(//crumbRequestField,":",//crumb)')

curl -v --user $USER:$TOKEN -H "$CRUMB" --data-urlencode \
    "script=$(cat scripts/example.groovy)" $JENKINS_URL/scriptText

Sending an HTTP POST request to /script or /scriptText endpoint

HTTP Request

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

java.nio.file.FileSystemException: 
/opt/jenkins/workspace/BestPipelineEver: 
No space left on device

/home/jenkins/agent
|
|_ workspace
| |_ [JOB_NAME]
|
|_ sharedspace
|
|_ tools
|
|_ remoting
| |_ jarCache
| |_ logs     
| |_ ...
|
|_ ...

Requirements

- Connectivity to all the slaves located in different datacenters
- Operating System Agnostic
- Schedulable execution

How to execute it?

Executing on the master , but run code on any other node programmatically

Send an HTTP POST request to /script or /scriptText endpoint

Using Jenkins web interface in Script Console in any node status page for executing on a target node (manually)

HTTP Request


def failingNodes = []

String OFFLINE_CAUSE = 'Temporally offline for maintenance action.'

Jenkins.instance.nodes.each { node ->
    def computer = node.toComputer()
    if (computer.channel == null || computer.offline) {
        println "INFO: Skipping $node.name slave: computer.channel is null or slave is offline"
        return
    }

    try {
        computer.setTemporarilyOffline(true, new OfflineCause.ByCLI(OFFLINE_CAUSE))
        computer.waitUntilOffline()
        computer.interrupt()

        println 'INFO: Deleting rootpath content'
        node.rootPath.deleteContents()
    } catch(exception) {
        failingNodes << node.name
    } finally {
        computer.setTemporarilyOffline(false, null)
    }
}

if (failingNodes) {
  return "ERROR: Cleaning failed in: $failingNodes"
}

return 'INFO: Script was successfully executed'

slaves_cleaner.groovy

void executeShellCommandsOnNode(channel, String ...shellCommands) {
  shellCommands.each { shellCommand ->
    groovyCommand = """
    command = ['/bin/sh', '-c'] + \"${shellCommand}\"
    println command
    ProcessBuilder builder = new ProcessBuilder(command)
    builder.redirectErrorStream(true)
    Process process = builder.start()
    process.waitFor()
    println process.in.text
    """
    println RemotingDiagnostics.executeGroovy(groovyCommand, channel)
  }
}

if (node.labelString.contains('docker')) {
    if (node.toComputer().unix) {
        println 'INFO: Pruning Docker containers/images'

        String pruneContainers = 'docker container prune -f'
        String pruneImages = 'docker image prune -f'
        executeShellCommandsOnNode(node.channel, pruneContainers, pruneImages)
    }
}

waiting for next executor...

waiting for next executor...

waiting for next executor...

waiting for next executor...

waiting for next executor...

waiting for next executor...

waiting for next executor...

Solutions

Add more executors to slaves
- Parallel executors?
- Host capacity
Add more slaves ($$$)
What's wrong with "Stuck" jobs:
- Long running
- Slowness or network problems
- waiting user input
- Slave deadlock
- ...

Stuck Jobs? Kill'em all

How to abort "stuck Jobs"

Build-Timeout plugin
Configuration per project
- + Configuration Slicing plugin
- Schedule reapply config
Might kill long-running (but non malfunctioning) jobs

Bad approach... let's detect Stuck jobs and analyze them

import groovy.json.JsonOutput

def result = [:]
def stuckBuilds = []
int MAX_MINUTES_RUNNING = 90

String jenkinsUrl = Jenkins.instance.getRootUrl()
Jenkins.instance.nodes.findAll {
  it.getAssignedLabels().find {
      it.name == 'public' 
    } && !it.toComputer().offline
  }.each { node ->
    def computer = node.toComputer()
    computer.allExecutors.findAll { it.busy }.each { executor ->
      int minutes = (executor.elapsedTime / 1000) / 60
      if (minutes >= MAX_MINUTES_RUNNING) {
      try {
        stuckBuilds << "${jenkinsUrl}${executor.currentExecutable.getUrl()}"
      } catch (Exception ex) {
        stuckBuilds << "Unknown error executor ${executor} - Exception ${ex}"
      }
    }
  }
}
result.hasStuckBuilds = stuckBuilds.size() > 0
if (result.hasStuckBuilds) {
  result.information = "The below builds have been running for more than ${MAX_MINUTES_RUNNING} minutes.\n"
  result.information = result.information + stuckBuilds.join('\n')
}

return JsonOutput.toJson(result)

Why JSON?

#!/bin/bash
authKey="Basic *****"
CRUMB=$(curl -s -H "authorization: $authKey" 'http://localhost:8080/jenkins/crumbIssuer/api/xml?xpath=concat(//crumbRequestField,":",//crumb)')
RESPONSE=$(curl -s -H "authorization: $authKey" -H "$CRUMB" --data-urlencode "script=$(cat <<"EOF"
import groovy.json.JsonOutput

...

result.hasStuckBuilds = stuckBuilds.size() > 0
if (result.hasStuckBuilds) {
  result.information = "The below builds have been running for more than ${MAX_MINUTES_RUNNING} minutes.\n"
  result.information = result.information + stuckBuilds.join('\n')
}

return JsonOutput.toJson(result)
EOF
)" http://localhost:8080/jenkins/scriptText)
if [ $? != 0 ]
then
    echo 'Error requesting to Jenkins'
    exit 2
fi

RESPONSE=$(echo $RESPONSE | sed "s/Result: //")
HAS_STUCK_BUILDS=$(echo $RESPONSE | jq '.hasStuckBuilds')
if [ $HAS_STUCK_BUILDS == 'true' ]
then
    echo $RESPONSE | jq -r '.information'
    exit 2
else
    echo 'No stuck builds'
    exit 0
fi

Icinga integration via shell script & jq

Loading Jenkins is taking ages...

Hi, any issues with Jenkins?

Please guys I need Jenkins to deploy in prod!

I can't connect to Jenkins!

Jenkins is unresponsive

is Jenkins dying?

Jenkins is so slow!

After some days running...

Jenkins is slow

Then it is slower

Becomes nearly unusable

Totally unresponsive

JMX Connection, check Heap

Hudson, we have a problem

No memory? Let's increase heap. 6GB, 8GB, 10GB...

Check logs... nothing.

Check GC Logs... Bingo !

After Jenkins restart...

Several hours later...

Suspicious: memory leaks that slowly increase heap memory usage, leaking objects that cannot be released by the GC

God Bless Prometheus & JMX Exporter

We need to discover what is leaking: Heap Dumps

Cron job

jcmd $(pidof java) GC.heap_dump /var/log/jenkins/heap-dump-$(date +%y%m%d-%H%M%S)

Very big heap dumps! Need some time to analyse. Meanwhile:

We set some alerts + preventive restart

if min(java_lang_Memory_HeapMemoryUsage_used{server="jenkinshost"}) > 6GB during last 12h

ALERT!

MTA (https://www.eclipse.org/mat/) helps us identify leakage

https://issues.jenkins-ci.org/browse/JENKINS-51057

GuavaPubSubBus + SSE-gateway plugin used by BlueOcean to deliver events to web browser and update UI.

After some investigation, we determine that the leak is caused by class https://github.com/jenkinsci/sse-gateway-plugin/blob/master/src/main/java/org/jenkinsci/plugins/ssegateway/sse/EventDispatcher.java which will queue events that cannot be delivered into a retryQueue event. This events will never be delivered unless the session is reopened, causing a memory leakage. Even more, when the session is destroyed, the dispatcher.unsubscribeAll(); method is called, but the retryQueue is not cleared. I am not sure, but apparently the dispatcher is still references as outer class of ChannelSubscriber in the same file, that is still in the list of subscribers of GuavaPubSubBus. So even when HTTPSession is destroyed, the memory leak will still be there.

Cleanup script from JENKINS-51057

// First of, due to groovy and private field in super classes, we need to
// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true)

def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
println("There are ${sessions.size()} sessions...")
def numEventsRemoved = 0
def numEventDispatchersPurged = 0
sessions.each{id, session->
  def eventDispatchers = session.sessionData._attributes['org.jenkinsci.plugins.ssegateway.sse.EventDispatcher']
  if (eventDispatchers) {
    println("Sessions ${id} has a ssegateway EventDispatcher map...")
    eventDispatchers.findAll{k, v -> k.startsWith('jenkins-blueocean-core-js')}.each { dispatcherId, dispatcher ->
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.isEmpty()) {
        println("  Found one EventDispatcher, '${dispatcherId}', it has no retry events.")
      } else {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        if (oldestAge > 300) {
          println("  Found one EventDispatcher, '${dispatcherId}', its oldest retry event is ${oldestAge} seconds old, it contains ${retryQueue.size()} retry events, removing events and unsubscribing.")
          numEventsRemoved += retryQueue.size()
          numEventDispatchersPurged++
          dispatcher.unsubscribeAll()
          retryQueue.clear()
     	 } else {
        println("  Found one EventDispatcher, its oldest retry event is ${oldestAge} seconds old, so sparing it for now...")
      	}
      }
    }
  }
}

println("Removed ${numEventsRemoved} retry events from ${numEventDispatchersPurged} EventDispatchers!")

But we need to test if it does work

Alternative: unofficial plugin fork with different event retry handling

Let's publish events on the PubSubBus
(currently only used by sse-gateway and blueocean)

import org.jenkinsci.plugins.pubsub.*;

PubsubBus bus = PubsubBus.getBus();

def numOfEventsPerJob = 10
def totalEvents = 0

Jenkins.instance.getAllItems(Job.class).each{ 
  for (int i = 0; i < numOfEventsPerJob; i++) {
    totalEvents += numOfEventsPerJob
	message = new JobMessage(it).setEventName(Events.JobChannel.job_run_started);
	bus.publish(message);    
  }
}

println "Done. Inserted " + totalEvents

Several days later... things are a bit better, but still leaking

Heap Dumps still show RetryEvents... WTF?

def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
println("There are ${sessions.size()} sessions...")
def numEventsRemoved = 0
def numEventDispatchersPurged = 0
sessions.each{id, session->
   ...
}

But HTTP sessions are closed, released, and RetryQueue is still there... were is it referenced from?

GuavaPubSubBus.subscribers: HashMap<ChannelSubscriber>

import org.jenkinsci.plugins.pubsub.PubsubBus;
import org.jenkinsci.plugins.ssegateway.sse.*;

def dryRun = false
this.bus = PubsubBus.getBus();

// change visibility of retryQueue so that we can use reflection instead...
def retryQueueField = org.jenkinsci.plugins.ssegateway.sse.EventDispatcher.getDeclaredField('retryQueue')
retryQueueField.setAccessible(true)

def dispatcherCount = 0
def dispatchersList = []

//Build a list of EventDispatchers in all existing HTTP sessions
println "DISPATCHERS IN HTTP SESSIONS"
println "----------------------------"
def sessions = Jenkins.instance.servletContext.this$0._sessionHandler._sessionCache._sessions
sessions.each { id, session ->
  def eventDispatchers = EventDispatcherFactory.getDispatchers(session)
  if (eventDispatchers) {
    eventDispatchers.each { dispatcherId, dispatcher ->
      dispatchersList.add(dispatcherId)
      def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
      if (retryQueue.peek() != null) {
        def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      } else {
        println "Dispatcher: " + dispatcher.getClass().getName() + " - " + dispatcher.id + " with no retryEvents"
      }      
    }
  }
}

println "There are " + dispatchersList.size() + " dispatchers in HTTP sessions"
println ""

//Find all subscribers in bus
println "DISPATCHERS IN PUBSUBBUS"
println "------------------------"
this.bus.subscribers.any { channelSubscriber, guavaSubscriber ->
  if (channelSubscriber.getClass().getName().equals('org.jenkinsci.plugins.ssegateway.sse.EventDispatcher$SSEChannelSubscriber')) {
    dispatcherCount++
    def dispatcher = channelSubscriber.this$0

    def retryQueue = retryQueueField.get(dispatcher) // Need to use reflection since retryQueue is private in super class...
    if (retryQueue.peek() != null) {
      def oldestAge = (System.currentTimeMillis() - retryQueue.peek().timestamp)/1000
      println "Dispatcher: " + dispatcher.id + " with " + retryQueue.size() + " events, oldest is " + oldestAge + " seconds."      
      if (oldestAge > 300) {
      	println "  Clearing retryQueue with " + retryQueue.size() + " events"
        if (!dryRun) {
          retryQueue.clear()
          dispatcher.unsubscribeAll()
          try {
            dispatcher.stop()
          } catch (Exception ex) {
            println "  !! Exception stopping AsynDispatcher"
          }
        } else {
          println "  Ignoring, dryrun"
        }
      }
    } else {
	    println "Dispatcher: " + dispatcher.id + " with no retryEvents"
    }

    if (dispatchersList.indexOf(dispatcher.id) < 0) {
      println "  Dispatcher is not in HTTP Sessions. Clearing"
      if (!dryRun) {
        dispatcher.unsubscribeAll()
        try {
          dispatcher.stop()
        } catch (Exception ex) {
          println "  !! Exception stopping AsynDispatcher"
        }
      } else {
        println "  Ignoring, dryrun"
      }
    }
  }
}

println ""
println "Dispatchers in PubSubBus: " + dispatcherCount

Heap dumps... no traces of RetryEvents!

However, some leak persists

From the heap dumps, we see other leak suspects related to slave channels. We need to investigate further

Nothing is perfect

Executing a script via HTTP request allows you only handle an error programmatically

Can't re-use code

Can't execute a script via HTTP request on a specific agent

Can't pass parameters

Thank you!

Questions?