Labs Server Admin Log

Projects are listed in order of most recently updated.

Also see the recent changes for nova resources (atom).

To log a message in #wikimedia-labs, use the following format: !log <project> <message>

1 Nova_Resource:Tools/SAL
2 Server Admin Log
3 Nova_Resource:Rcm.cac/SAL
4 Nova_Resource:Tools.wikibugs/SAL
5 Nova_Resource:Tools.heritage/SAL
6 Release Engineering/SAL
7 Nova_Resource:Tools.admin/SAL
- 7.1 2016-05-06
8 Nova_Resource:Mobile/SAL
9 Nova_Resource:Redirects/SAL
- 9.1 2016-05-05
10 Nova_Resource:Math/SAL

Nova_Resource:Tools/SAL

2016-05-08

07:06 YuviPanda: restarted admin tool

2016-05-05

13:11 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652/ on puppetmaster

2016-04-28

04:15 YuviPanda: delete half of the trusty webservice jobs
04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up

2016-04-24

12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman

2016-04-11

14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009

2016-04-06

15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01

2016-04-05

21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
21:02 bd808: Forcing puppet runs to fix elasticsearch
20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs

2016-04-04

19:43 YuviPanda: new bastion!
19:15 chasemp: reboot tools-bastion-05

2016-03-30

15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches

2016-03-28

20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i

2016-03-27

17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).

2016-03-18

15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*

2016-03-11

20:57 mutante: reverted font changes - puppet runs recovering
20:37 mutante: more puppet issues due to font dependencies on trusty, on it
19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)

2016-03-02

14:56 chasemp: qdel 3956069 and 3758653 for abusing auth

2016-02-29

21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
13:50 scfc_de: Deployed jobutils/misctools 1.10.

2016-02-28

20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs

2016-02-26

19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5

2016-02-25

21:43 scfc_de: Deployed jobutils/misctools 1.9.

2016-02-24

19:46 chasemp: runonce deployed for https://gerrit.wikimedia.org/r/#/c/272891/

2016-02-22

15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05

2016-02-19

15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
00:50 yuvipanda: failover services to services-02

2016-02-18

20:37 yuvipanda: failover proxy back to tools-proxy-01
19:46 chasemp: repool labvirt1003 and depool labvirt1004
18:19 chasemp: draining nodes from labvirt1001

2016-02-16

21:33 chasemp: reboot of bastion-1002

2016-02-12

19:56 chasemp: nfs traffic shaping pilot round 2

2016-02-05

22:01 chasemp: throttle some vm nfs write speeds
16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).

2016-02-03

03:00 YuviPanda: upgraded flannel on all hosts running it

2016-01-31

20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
13:32 hashar: restarted qamorebot

2016-01-30

06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.

2016-01-29

21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file

2016-01-28

15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D. *argl*
14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.

2016-01-27

23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
19:34 chasemp: master start grid master
19:23 chasemp: stopped master
19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|[email protected] reports running job (2551539.1/master) in queue "[email protected]" that was not supposed to be there - killing". SSH'ing there to investigate
18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
18:23 valhallasw`cloud: no errors in log file, qstat works
18:23 chasemp: master sge restarted post dump and restart for jobs db
18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
18:14 chasemp: grid master stopped
00:56 scfc_de: Deployed admin/www bde15df..12a3586.

2016-01-26

21:28 YuviPanda: qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs

2016-01-25

20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.

2016-01-23

15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.

2016-01-21

22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
21:13 YuviPanda: repooled exec nodes on labvirt1010
21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
21:00 YuviPanda: stop gridengine master
20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
20:51 YuviPanda: repooled exec nodes on labvirt1006
20:39 YuviPanda: failover tools-static too tools-web-static-01
20:38 YuviPanda: failover tools-checker to tools-checker-01
20:32 YuviPanda: depooled exec nodes on 1007
20:32 YuviPanda: repooled exec nodes on 1006
20:14 YuviPanda: depooled all exec nodes in labvirt1006
20:11 YuviPanda: repooled exec node son 1005
19:53 YuviPanda: depooled exec nodes on labvirt1005
19:49 YuviPanda: repooled exec nodes from labvirt1004
19:48 YuviPanda: failed over proxy to tools-proxy-01 again
19:31 YuviPanda: depooled exec nodes from labvirt1004
19:29 YuviPanda: repooled exec nodes from labvirt1003
19:13 YuviPanda: depooled instances on labvirt1003
19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
19:02 YuviPanda: failed over tools proxy to tools-proxy-02
18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead

2016-01-12

09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).

2016-01-11

22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
22:12 YuviPanda: restarted gridengine master again
22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
21:57 valhallasw`cloud: reset to 7:30
21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
21:45 YuviPanda: restarted gridengine master
21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
21:41 valhallasw`cloud: currently 353 jobs in qw state
21:40 valhallasw`cloud: that's load_adjustment_decay_time
21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
17:51 YuviPanda: kill all queries running on labsdb1003
17:20 YuviPanda: stopped webservice for quentinv57-tools

2016-01-09

21:07 valhallasw`cloud: moved tools-checker/208.80.155.229 back to tools-checker-01
21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.

2016-01-08

19:46 chasemp: couldn't get into tools-mail-01 at all and it seemed borked so I rebooted
17:23 andrewbogott: killing tools.icelab as per https://wikitech.wikimedia.org/wiki/User_talk:Torin#Running_queries_on_tools-dev_.28tools-bastion-02.29

2015-12-30

04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
00:40 YuviPanda: restarted master on grid-master
00:40 YuviPanda: copied and cleaned out spooldb
00:10 YuviPanda: reboot tools-grid-shadow
00:08 YuviPanda: attempt to stop shadowd
00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
00:00 YuviPanda: kill -9'd gridengine master

2015-12-29

23:31 YuviPanda: rebooting tools-grid-master
23:22 YuviPanda: restart gridengine-master on tools-grid-master
00:18 YuviPanda: shut down redis on tools-redis-01

2015-12-28

22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
21:32 YuviPanda: disable puppet on tools-redis-01 and -02
21:27 YuviPanda: created tools-redis-1001

2015-12-23

21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
18:40 valhallasw`cloud: scratch that, first going to eat dinner
18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel

2015-12-22

18:30 YuviPanda: rescheduling all webservices
18:17 YuviPanda: failed over active proxy to proxy-01
18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
01:42 YuviPanda: rebooting tools-worker-08

2015-12-21

18:44 YuviPanda: reboot tools-proxy-01
18:31 YuviPanda: failover proxy to tools-proxy-02

2015-12-20

00:00 YuviPanda: tools-worker-08 stuck again :|

2015-12-18

15:16 andrewbogott: rebooting locked up host tools-exec-1409

2015-12-16

23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
21:28 andrewbogott: deleted tools-docker-registry-01
16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup

2015-12-12

10:08 YuviPanda: restarted cron on tools-submit

2015-12-10

12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.

2015-12-07

13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
10:46 YuviPanda: restarted nscd on tools-proxy-01

2015-12-06

10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest

2015-12-04

19:33 Coren: switching master role to tools-grid-master
04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01

2015-12-02

18:29 Coren: switching gridmaster activity to tools-grid-shadow
05:13 yuvipanda: increased security groups quota to 50 because why not

2015-12-01

21:07 yuvipanda: added bd808 as admin
21:01 andrewbogott: deleted tool/service group tools.test300

2015-11-25

15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002

2015-11-20

22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
19:25 Coren: -lighttpd-1403 wants a restart.
19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services

2015-11-17

19:39 YuviPanda: created tools-worker-03 to be k8s worker node
19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens

2015-11-16

20:44 PlasmaFury: switch over the proxy to tools-proxy-01
17:38 PlasmaFury: deleted tools-webgrid-lighttpd-1412 for https://phabricator.wikimedia.org/T118654

2015-11-03

03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).

2015-11-02

22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs

2015-10-26

20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts

2015-10-11

22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code

2015-10-09

22:47 yuvipanda: kill NFS on tools-puppetmaster-01 with https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-puppetmaster-01
14:37 Coren: Beginning rotation of execution nodes to apply fix for T106170

2015-10-06

04:35 yuvipanda: created tools-puppetmaster-02 as hot spare

2015-10-02

17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).

2015-10-01

23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel

2015-09-30

07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
06:40 yuvipanda: migrated webproxy to tools-proxy-01

2015-09-29

12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-09-28

15:24 Coren: rebooting tools-shadow after mount option changes.

2015-09-25

16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-24

14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
13:56 scfc_de: tools-master: Restarted grid engine master for T109485.

2015-09-23

18:22 valhallasw`cloud: here = https://etherpad.wikimedia.org/p/74j8K2zIob
18:22 valhallasw`cloud: experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here.

2015-09-16

17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
01:17 YuviPanda: attempting to move to kubernetes

2015-09-15

01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.

2015-09-14

23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.

2015-09-13

20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).

2015-09-11

14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-08

08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.
Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools

2015-09-07

18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
18:47 valhallasw`cloud: switched static webserver to tools-static-02
18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master

2015-09-03

07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
07:07 valhallasw`cloud: err, is empty.
07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!

2015-09-02

15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
13:16 YuviPanda: deleted all jobs of ralgisbot
13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles

2015-09-01

21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
06:23 valhallasw`cloud: seems to have worked. SGE :(
06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
06:08 valhallasw`cloud: e.g. "queue instance "[email protected]" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
06:06 valhallasw`cloud: investigating SGE issues reported on irc/email

2015-08-31

23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
21:20 valhallasw`cloud: restarted webservicemonitor
21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)

2015-08-30

13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
13:20 valhallasw`cloud: disabling 503 error page

2015-08-29

04:09 scfc_de: Disabled queue [email protected] (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".

2015-08-27

15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again

2015-08-26

01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.

2015-08-25

20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
14:58 YuviPanda: pooled in two new instances for the precise exec pool
14:45 YuviPanda: reboot tools-exec-1221
14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
14:18 YuviPanda: pooled in tools-webgrid-generic-1405
10:16 YuviPanda: created tools-webgrid-generic-1405
10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
09:59 YuviPanda: created tools-exec-1220 and -1221

2015-08-24

16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01

2015-08-20

18:44 valhallasw`cloud: both are now at 3dbbc87
18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
17:06 valhallasw`cloud: wait, what timezone is this?!

2015-08-19

10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406

2015-08-18

15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°）╯︵ ┻━┻
08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> [email protected] added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
08:00 valhallasw`cloud: running puppet agent -tv again
07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...

2015-08-17

19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01

2015-08-15

05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
05:10 andrewbogott: suspending tools-exec-gift, just for a moment...

2015-08-14

17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2015-08-13

18:51 valhallasw`cloud: which was resolved by scfc earlier
18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid
Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
14:48 andrewbogott: and tools-webgrid-lighttpd-1408
14:48 andrewbogott: rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405

2015-08-12

16:05 andrewbogott: depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
15:20 valhallasw`cloud: re-enabling queues on restarted hosts
14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410

2015-08-11

18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow

2015-08-04

13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).

2015-08-03

19:13 andrewbogott: deleted tools-static-01

2015-08-01

18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).

2015-07-30

15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).

2015-07-29

23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}

2015-07-28

17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
17:16 valhallasw`cloud: disabled queue "[email protected]"
02:07 YuviPanda: removed pacct files from tools-bastion-01

2015-07-27

21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of phab:T107052:
```
accton off
```

2015-07-19

01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-07-11

00:01 mutante: fixing puppet runs on tools-webgrid-* via salt

2015-07-10

23:59 mutante: fixing puppet runs on tools-exec via salt

2015-07-10

20:09 valhallasw`cloud: it took three of us, but adminbot is updated!

July 6

09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)

July 2

17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
16:12 valhallasw`cloud: I mean tools-bastion-01
16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/

June 29

17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02

June 21

18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).

June 19

15:07 YuviPanda: remounting /data/scratch

June 10

11:52 YuviPanda: tools-trusty be gone

June 8

16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access

June 7

17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS

June 5

17:44 YuviPanda: migrate tools-shadow to labvirt1002

June 2

18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
16:20 Coren: switching back to tools-master
16:10 YuviPanda: restart nscd on tools-submit
15:54 Coren: Switching names for tools-exec-1401
15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
14:34 YuviPanda: turned off dnsmasq for toollabs
13:54 Coren: adding new-style names for submit hosts
13:53 YuviPanda: moved tools-master / shadow to designate
13:52 Coren: new-style names for gridengin admin hosts added
13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
13:17 Coren: killing the sge_qmaster to test failover
12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd

May 29

13:39 YuviPanda: tools-redis-01 is redis master now
13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
13:01 YuviPanda: recreating tools-redis-01 and -02
12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff

May 28

12:22 wm-bot: petrb: inserted some local IP's to hosts file
12:15 wm-bot: petrb: shutting nscd off on tools-master
12:14 wm-bot: petrb: test
11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
11:25 petan: rebooted tools-master in order to try fix that network issues

May 27

20:10 LostPanda: disabled puppet on tools-shadow too
19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone?
19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS

May 23

19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).

May 22

20:37 yuvipanda: deleted and depooled tools-exec-07

May 20

20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
20:01 yuvipanda: enabling puppet on all hosts
20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
19:54 yuvipanda: enabled puppet on tools-precise-dev
19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state

May 19

21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
20:12 yuvipanda: force killed croptool webservice

May 18

01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
01:32 yuvipanda: killed tools-checker-01 instance, recreating

May 15

12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis

May 14

21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
03:29 yuvipanda: drained, depooled and deleted tools-exec-15

May 10

22:08 yuvipanda: created tools-precise-dev instance
09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.

May 5

18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab

May 4

21:24 yuvipanda: reboot tools-submit, was stuck

May 2

10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
10:13 yuvipanda: cleaned out wegrid jobs from tools-webgrid-03
10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
08:56 yuvipanda: drained and deleted tools-webgrid-01
07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
01:58 yuvipanda: increased tools instance quota

May 1

03:55 YuviKTM: depooled and deleted tools-exec-20
03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node

April 30

19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
05:40 YuviKTM: pooled in tools-exec-121{1-9}
05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
05:39 YuviKTM: delete tools-exec-10, was out of jobs
04:28 YuviKTM: deleted tools-exec-09
04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
04:23 YuviKTM: repooled tools-exec-1201 is all good now
04:19 YuviKTM: rejuggle jobs again in trustyland
04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
04:08 YuviKTM: depooled tools-exec-09, apt troubles
04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
04:00 YuviKTM: pooled tools-exec-1406 and 1407
03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
03:53 YuviKTM: depooled tools-exec-03 / 04
03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
03:18 YuviKTM: pooled tools-exec-1403, 1404
03:13 YuviKTM: pooled tools-exec-1402
03:07 YuviKTM: pooled tools-exec-1405
03:04 YuviKTM: pooled tools-exec-1401
02:53 YuviKTM: created tools-exec-14{06-10}
02:14 YuviKTM: created toolx-exec-14{01-05}
01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod

April 29

19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: Nova_Resource:I-00000bca.eqiad.wmflabs
19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
19:28 YuviPanda: recreated tools-static-02
19:11 YuviPanda: failed over tools-static to tools-static-01
14:47 andrewbogott: deleting tools-exec-04
14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
14:41 Coren: disabled -exec-04 (going away)
02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
02:27 YuviPanda: created tools-exec-12{01-10}

April 28

21:41 andrewbogott: shrinking tools-master
21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
21:32 andrewbogott: shrinking tools-redis
21:28 YuviPanda: attempting to failover gridengine to tools-shadow
21:27 andrewbogott: shrinking tools-submit |
21:21 YuviPanda: backup crontabs onto NFS
21:18 andrewbogott: shrinking tools-webproxy-02
21:14 andrewbogott: shrinking tools-static-01
21:11 andrewbogott: shrinking tools-exec-gift
21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
21:01 YuviPanda: failover tools-static to tools-static-02
20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
20:43 andrewbogott: stopping, shrinking, starting tools-static-02
20:39 valhallasw`cloud: created tools-mailrelay-01 Nova_Resource:I-00000bac.eqiad.wmflabs
20:26 YuviPanda: failed over tools-services to services-01
18:11 Coren: reenabled -webgrid-generic-02
18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
14:04 Coren: reenable -exec-11 for jobs.
13:55 andrewbogott: stopping tools-exec-11 for a resize experiment

April 25

01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug

April 24

16:29 Coren: repooled -exec-02, -08, -12
16:05 Coren: -exec-02, -08 and -12 draining
15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
15:41 Coren: -exec-03 goes away for good.
15:31 Coren: draining -exec-03 to ease migration
13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot

April 23

22:41 YuviPanda: disabled *@tools-exec-09
22:40 YuviPanda: add tools-exec-09 back to @general
22:38 YuviPanda: take tools-exec-09 from @general group
20:53 YuviPanda: restart bigbrother
20:28 YuviPanda: restarted nscd on tools-login and tools-dev
20:22 valhallasw`cloud: removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts
13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
01:00 YuviPanda: good bye tools-login.eqiad.wmflabs

April 20

13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.

April 18

20:09 YuviPanda: sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting

April 17

01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02

April 16

20:57 Coren: -webgrid-08 drained, rebooting
20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
20:45 Coren: -webgrid-03 drained, rebooting
20:38 Coren: -webgrid-03 depooled
20:38 Coren: -webgrid-02 repooled
20:35 Coren: -webgrid-02 drained, rebooting
20:33 Coren: -webgrid-02 depooled
20:32 Coren: -webgrid-01 repooled
20:06 Coren: -webgrid-01 drained, rebooting.
19:56 Coren: depooling -webgrid-01 for reboot
14:37 Coren: rebooting -master
14:29 Coren: rebooting -mail
14:22 Coren: rebooting -shadow
14:22 Coren: -exec-15 repooled
14:19 Coren: -exec-15 drained, rebooting.
13:46 Coren: -exec-14 repooled. That's it for general exec nodes.
13:44 Coren: -exec-14 drained, rebooting.

April 15

21:06 Coren: -exec-10 repooled
20:55 Coren: -exec-10 drained, rebooting
20:49 Coren: -exec-07 repooled.
20:47 Coren: -exec-07 drained, rebooting
20:43 Coren: -exec-06 requeued
20:41 Coren: -exec-06 drained, rebooting
20:15 Coren: repool -exec-05
20:10 Coren: -exec-05 drained, rebooting.
19:56 Coren: -exec-04 repooled
19:52 Coren: -exec-04 drained, rebooting.
19:41 Coren: disabling new jobs on remaining (exec) precise instances
19:32 Coren: repool -exec-02
19:30 Coren: draining -exec-04
19:29 Coren: -exec-02 drained, rebooting
19:28 Coren: -exec-03 rebooted, requeing
19:26 Coren: -exec-03 drained, rebooting
18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
18:43 Coren: tools-exec-01 back sans idmap, returning to pool
18:40 Coren: tools-exec-01 drained of jobs; rebooting
18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.

April 14

13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 13

21:11 YuviPanda: restart portgranter on all webgrid nodes

April 12

10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 11

21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
02:15 YuviPanda: rebooted tools-submit, was not responding

April 10

07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
05:20 YuviPanda: delete the tomcat node finally :D

April 9

23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

April 8

13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.

April 7

07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 5

10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 4

22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).

April 3

22:55 scfc_de: Removed empty cgi-bin directories.
20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 2

20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
01:25 YuviPanda: created tools-bastion-02

April 1

00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).

March 31

14:02 Coren: rebooting tools-submit
07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources. It can be restarted any time.

March 30

22:53 Coren: resyncing project storage with rsync
22:40 Coren: reboot tools-login
22:30 Coren: also bastion2
22:28 Coren: reboot bastion1 so users can log in
21:49 Coren: rebooting dedicated exec nodes.
21:49 Coren: rebooting tools-submit
17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 29

19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.

March 28

19:42 YuviPanda: created tools-exec-20

March 26

21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 25

16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 24

16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 23

21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up

March 22

23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01

March 21

16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

March 15

22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 13

16:23 YuviPanda: cleaned out / on tools-trusty

March 11

04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
03:56 YuviPanda: restarted redis server, it had OOM-killed

March 9

11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
08:27 scfc_de: qmod -cq [email protected] (OOM of two jobs in the past).

March 7

12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.

March 6

07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
07:43 scfc_de: Deployed jobutils/misctools 1.4.

March 2

09:53 YuviPanda: added ananthrk to project
08:41 YuviPanda: delete tools-uwsgi-01
08:11 YuviPanda: delete tools-uwsgi-02 because https://phabricator.wikimedia.org/T91065

March 1

15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure

February 28

07:51 YuviPanda: create tools-webgrid-07
01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
01:00 Coren: Also That was -webgrid-05
00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.

February 27

17:53 YuviPanda: increased quota to 512G RAM and 256 cores
15:33 Coren: Switched back to -master. I'm making a note here: great success.
15:27 Coren: Gridengine master failover test part three; killing the master with -9
15:20 Coren: Gridengine master failover test part deux - now with verbose logs
15:10 YuviPanda: created tools-webgrid-generic-02
15:10 YuviPanda: increase instance quota to 64
15:10 Coren: Master restarted - test not sucessful.
14:50 Coren: testing gridengine master failover starting now
08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well

February 24

18:33 Coren: tools-submit not recovering well from outage, kicking it.
17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs

February 16

02:31 scfc_de: rm -f /var/log/exim4/paniclog.

February 13

18:01 Coren: tools-redis is dead, long live tools-redis
17:48 Coren: rebuilding tools-redis with moar ramz
17:38 legoktm: redis on tools-redis is OOMing?
17:26 marktraceur: restarting grrrit-wm because it's not behaving

February 1

10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
07:51 YuviPanda: cleared error state of stuck queues
06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
04:10 YuviPanda: widar moved to trusty
03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.

January 29

17:26 YuviPanda: reschedule all tomcat jobs

January 27

23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo

January 19

20:51 YuviPanda: because valhallasw is nice
10:34 YuviPanda: manually started tools-webgrid-generic-01
09:48 YuviPanda: restarted toold-webgrid-03
08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.

January 16

22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 15

22:10 YuviPanda: created instance tools-webgrid-generic-01

January 11

06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 8

07:40 YuviPanda: increase memory limit for autolist from 4G to 7G

December 23

06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000

December 22

07:43 YuviPanda: increased RAM and Cores quota for tools

December 19

16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
12:00 YuviPanda|brb: created tools-static, static http server
07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

December 17

22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get

December 12

14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.

December 11

07:58 YuviPanda: rebooted tools-login, wasn’t responsive.

December 8

00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy

December 7

06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).

December 2

21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 26

19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty

November 25

06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 24

14:02 YuviPanda: rebooting tools-login, OOM'd
02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 22

19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 17

20:40 YuviPanda: cleaned out /tmp on tools-login

November 16

21:31 matanya: back to normal
21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"

November 15

07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda

November 14

20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
18:26 YuviPanda: cleaned out core dumps on tools-webgrid
16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).

November 13

21:11 YuviPanda: disable puppet on tools-dev to check shinken
21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
20:38 YuviPanda: didn't actually stop puppet, need more patches
20:38 YuviPanda: stopping puppet on tools-dev to test shinken
15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.

November 12

22:07 StupidPanda: enabled puppet on tools-exec-07
21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken

November 7

13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).

November 6

13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).

November 5

19:15 mutante: exec nodes have p7zip-full now
10:07 YuviPanda: cleaned out pacct and atop logs on tools-login

November 4

19:50 mutante: - apt-get clean on tools-login, and gzipped some logs

November 1

12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).

October 30

14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp

October 27

16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.

October 26

12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.

October 24

20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006

October 23

22:55 Coren: reboot tools-shadow, upstart seems hosed

October 14

23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07

October 11

15:31 andrewbogott: rebooting tools-master, stab in the dark
06:01 YuviPanda: restarted gridengine-master on tools-master

October 4

18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b

October 2

17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools

September 28

14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3

September 25

14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now

September 17

21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap

September 15

11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work

September 13

20:52 yuvipanda: cleaned out rotated log files on tools-webproxy

September 12

21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase

September 8

16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)

September 5

22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"

September 4

19:47 lokal-profil: local-heritage Renamed two swedish tables

September 2

04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076

August 23

17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11 : before job")

August 21

20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils

August 15

16:45 legoktm: fixed grrrit-wm
16:36 legoktm: restarting grrrit-wm

August 14

22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^$[0-9]\+$ .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529

August 12

03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again

August 2

16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
16:36 scfc_de: Manually rerouted mails for [email protected]

August 1

22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")

July 24

20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts

July 21

18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again

July 18

14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
13:24 scfc_de: Made tools-webgrid-04 a grid submit host
12:58 scfc_de: Made tools-webgrid-03 a grid submit host

July 16

22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st

July 15

11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04

July 14

20:40 andrewbogott: on tools-login
20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update

July 13

13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore

July 12

17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later

July 11

11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts

July 10

20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup

July 9

23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
23:09 YuviPanda: created tools-exec-13 with precise
23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
23:07 YuviPanda: created tools-exec-12
23:06 YuviPanda: created tools-exec-11
19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08

July 8

20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
17:26 scfc_de: tools-tcl-test: Rebooted because system said so
17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")

July 6

12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged

July 5

22:36 YuviPanda: cleared diamond archive logs on a bunch of machines, submitted patch to get rid of archive logs
22:17 YuviPanda: changed grid scheduling config, set weight_priority to 0.1 from 0.0 for https://bugzilla.wikimedia.org/show_bug.cgi?id=67555

July 4

08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond

July 3

16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
14:37 Betacommand: replication for enwiki is halted current lag is at 9876

July 2

00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats

July 1

23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
21:08 scfc_de: Reset queues in error state again
17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*

June 30

22:10 YuviPanda: ran webservice start for enwp10
22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
22:01 YuviPanda: removed stale lockfile for puppet, forcing run
19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
17:27 YuviPanda: created tools-webgrid-03 and added it to the queue

June 29

19:45 scfc_de: magnustools: "webservice start"
18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead

June 28

21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy

June 21

20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki

June 20

21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
14:47 scfc_de: Restarted webservice for mono; cf. bug #64219

June 16

23:50 scfc_de: Shut down diamond services and removed log files on all hosts

June 15

17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
16:50 scfc_de: qmod -cq [email protected]
16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)

June 13

22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange

June 10

21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var

June 3

17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.

June 2

20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.

May 27

18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log

May 25

14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors

May 23

14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors

May 22

02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
01:12 scfc_de: tools-mail: /var is full

May 20

18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues

May 16

17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, 3861106688)

May 14

16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
00:23 Betacommand: 503's related to bug 65179

May 13

20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
20:36 valhallasw: redis failed, causing tools-webproxy to thow 503's
19:09 marktraceur: Restarted grrrit because it had a stupid nick

May 10

14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1

May 9

13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes

May 6

19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
17:53 scfc_de: Don't think replagstats is really working ...
16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #48694).

April 28

11:51 YuviPanda: pywikibugs Deployed bf1be7b

April 27

13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1

April 24

23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
23:38 YuviPanda: restarted greg-g after cherry-picking aec09a6 for auth of IRC bot
23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/129610
13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)

April 20

14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
14:08 scfc_de: tools-redis: /var is full
08:59 legoktm: grrrit-wm: 2014-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:6379 failed - connect ECONNREFUSED
08:48 legoktm: Your job 438884 ("lolrrit-wm") has been submitted
08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)

April 13

14:20 scfc_de: Restarted webservice for wikihistory to see if the change to PHP_FCGI_MAX_REQUESTS increases reliability
14:17 scfc_de: tools-webgrid-01, tools-webgrid-02: Set PHP_FCGI_MAX_REQUESTS to 500 in /usr/local/bin/lighttpd-starter per http://redmine.lighttpd.net/projects/1/wiki/docs_performancefastcgi#Why-is-my-PHP-application-returning-an-error-500-from-time-to-time

April 12

23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")

April 11

16:21 scfc_de: tools-login: Killed -HUP process consuming 2.6 GByte; cf. wikitech:User talk:Ralgis#Welcome to Tool Labs

April 10

18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes

April 8

05:06 Ryan_Lane: restart nginx on tools-proxy-test
05:03 Ryan_Lane: upgraded libssl on all nodes

April 4

15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request

March 30

18:16 scfc_de: Removed empty directories /data/project/{d930913,sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier

March 29

10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server

March 28

11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login

March 21

16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis 50383 4096 Mär 17 16:42", perhaps artifact of mass migration?)
15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:62132]]
15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8

March 20

18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]

March 5

13:57 wm-bot: petrb: test

March 4

22:35 wm-bot: petrb: uninstalling it from -login too
22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there

March 3

19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
19:17 wm-bot: petrb: upgrading all packages on webserver-02
19:15 petan: rebooting webserver-01 which is totally dead
19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.

March 1

03:42 Coren: disabled puppet in pmtpa tool labs\

February 28

14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"

February 27

15:28 scfc_de: chmod g-w ~fsainsbu/.forward

February 25

22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.

February 23

20:46 scfc_de: morebots: labs HUPped to reconnect to IRC

February 21

17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either

February 20

21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at User talk:Reza#Running bots on tools-login, etc. (fa:بحث_کاربر:Reza1615 is write-protected)
20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at ko:사용자토론:ChongDae#Running bots on tools-login, etc.
10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch

February 19

20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127

February 18

11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 14

23:54 legoktm: restarting grrrit-wm since it disappeared
08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 13

13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help

February 12

07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc

February 11

15:47 scfc_de: Restarted webservice for geohack
13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05

February 10

23:16 Coren: rebooting webproxy (braindead autofs)

February 9

18:14 legoktm: restarting grrrit-wm, it keeps joining and quitting
04:27 legoktm: rebooting grrrit-wm - https://gerrit.wikimedia.org/r/#/c/112308

February 6

22:50 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/111889

February 4

20:38 legoktm: restarting grrrit-wm: 'Send mediawiki/extension/Thanks to -corefeatures' https://gerrit.wikimedia.org/r/111257

January 31

03:43 scfc_de: Cleaned up all exim queues
01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)

January 30

21:48 scfc_de: chmod g-w ~fluff/.forward
21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")

January 28

04:27 scfc_de: tools-webproxy: Blocked Phonifier

January 25

05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)

January 24

01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
00:11 scfc_de: tools-db: and restarted mysqld
00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/

January 23

19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
19:23 legoktm: ^ was for grrrit-wm
19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already

January 21

17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf

January 20

07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop

January 16

17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot

January 15

13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19 : before writing exit_status")
13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
08:54 andrewbogott: rebooted tools-exec-09
08:32 andrewbogott: rebooted tools-db

January 14

15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)

January 10

10:41 legoktm: grrrit-wm: restarting https://gerrit.wikimedia.org/r/106670
09:00 legoktm: grrrit-wm: setting up #mediawiki-feed, https://gerrit.wikimedia.org/r/106555

January 9

18:26 legoktm: rebased grrrit-wm on origin/master since fetching gerrit was failing
18:21 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/#/c/106501/

January 8

13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09

January 7

18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)

January 6

14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead

January 1

13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available

December 27

07:39 legoktm: grrrit-wm restart didn't really work.
07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak

December 23

18:30 marktraceur: restart grrrit-wm for subbu

December 21

06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update

December 19

17:22 marktraceur: deploying grrrit config change

December 17

23:19 legoktm: rebooted grrrit-wm with new config stuffs

December 14

18:13 marktraceur: restarting grrrit-wm to fix its nickname
13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)

December 4

22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS

December 1

14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.

November 25

21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)

November 24

07:19 paravoid: disabled crontab for user avocato on tools-login, see above
07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion

November 14

09:12 ori-l: Added aude to lolrrit-wm maintainers group

November 13

22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?

November 3

16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.

November 1

16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco

October 23

15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)

October 20

18:52 wm-bot: petrb: everything looks better
18:51 wm-bot: petrb: restarting apache server on tools-webproxy
18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it

October 15

21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.

October 10

09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again

September 23

06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
05:15 legoktm: logging a log to test the log logging
05:13 legoktm: logging a log to test the log logging

September 11

09:39 wm-bot: petrb: started toolwatcher

August 24

18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot

August 23

12:17 wm-bot: petrb: test
12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
11:49 wm-bot: petrb: syncing packages of -login with exec nodes
11:48 petan: someone installed firefox on exec nodes, should investigate / remove

August 22

01:24 scfc_de: tools-webserver-03: Installed python-oursql

August 20

23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments

August 19

09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator

August 16

00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots

August 15

15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
14:24 scfc_de: chmod 644 ~magnus/.forward
03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
02:02 scfc_de: robots.txt: "Disallow: /"

August 11

03:14 scfc_de: tools-mc: Purged memcached

August 10

02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
02:24 scfc_de: chmod g-w ~whym/.forward

August 6

19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab

August 5

20:32 scfc_de: chmod g-w ~ladsgroup/.forward

August 2

23:45 scfc_de: tools-dev: Installed dialog for testing

August 1

19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables

July 31

10:50 HenriqueCrang: ptwikis added graph with mobile edits

July 30

19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
07:32 wm-bot: petrb: deleted local-addbot jobs
02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.

July 29

15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
02:40 scfc_de: Restarted toolwatcher on tools-login.
02:11 scfc_de: Reboot tools-login, was not responsive

July 25

23:37 Ryan_Lane: added myself to lolrrit-wm tool
12:06 wm-bot: petrb: test
07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing

July 20

15:19 petan: rebooting tools-redis

July 19

07:06 petan: instances were rebooted for unknown reasons
00:42 helderwiki: it works! :-)
00:41 legoktm: test

July 10

18:04 wm-bot: petrb: installing mysqltcl on grid
18:01 wm-bot: petrb: installing tclodbc on grid

July 5

19:38 AzaToth: test
19:36 AzaToth: test for example
18:23 Coren: brief outage of webproxy complete (back to business!)
18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)

July 3

13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
12:58 petan: rebooting -mc it's aparently OOM dying

July 2

16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl

July 1

18:39 wm-bot: petrb: started toolwatcher on - login
14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
12:05 wm-bot: petrb: starting toolwatcher
11:40 wm-bot: petrb: tools is back o/
09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh

June 30

17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda

June 26

21:16 Coren: +Tim Landscheidt to project admins, local-admin
14:23 wm-bot: petrb: updating several packages on -login
13:43 wm-bot: petrb: killing old instance of redis: Jun15 ? 00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
13:42 wm-bot: petrb: restarting redis
13:28 wm-bot: petrb: running puppet on -mc
13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
09:35 wm-bot: petrb: updated status.php to version which display free vmem as well

June 25

12:34 wm-bot: petrb: installing php5-mcrypt on exec and web

June 24

15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
07:57 wm-bot: petrb: 50527 4186 22830 1 Jun23 pts/41 00:08:54 python fill2.py eats 48% of ram on -login

June 19

12:17 wm-bot: petrb: increasing limit on mysql connections

June 17

17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing

June 16

21:23 Coren: 1.0.3 deployed (jobutils, misctools)

June 15

21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
21:39 wm-bot: petrb: db has 5% free RAM eeeek
18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
12:33 wm-bot: petrb: installing redis on tools-mc

June 14

12:35 wm-bot: petrb: updating logsplitter to new version

June 13

21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources

June 11

07:36 wm-bot: petrb: installed libdigest-crc-perl

June 10

13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
08:25 wm-bot: petrb: fixing missing packages on exec nodes

June 9

20:44 wm-bot: petrb: moved logs on -login to separate storage

June 8

21:24 wm-bot: petrb: installing python-imaging-tk on grid
21:20 wm-bot: petrb: installing python-tk
21:16 wm-bot: petrb: installing python-flickrapi on grid
21:16 wm-bot: petrb: installing
16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
09:55 wm-bot: petrb: backporting tcl 8.6 from debian
09:38 wm-bot: petrb: update python requests to version 1.2.3.1

June 7

15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)

June 5

09:52 wm-bot: petrb: on -dev
09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
09:28 wm-bot: petrb: installed openjdk7 on -dev
09:00 wm-bot: petrb: removing wd-terminator service
08:39 wm-bot: petrb: started toolwatcher
07:04 wm-bot: petrb: installing maven on -dev

June 4

14:49 wm-bot: petrb: installing sbt in order to fix b48859
13:28 wm-bot: petrb: installing csh on cluster
08:37 wm-bot: petrb: installing python-memcache on exec nodes

June 3

21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
14:15 wm-bot: petrb: removing popularity contest
14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc

June 2

08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc

June 1

20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
20:42 wm-bot: petrb: installing wikitools cluster wide
20:40 wm-bot: petrb: installing oursql cluster wide
10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc

May 31

19:17 petan: deleting xtools project (requested by Cyberpower678)
17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
17:17 wm-bot: petrb: installed lsof to -dev
15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
13:04 wm-bot: petrb: installing bashdb on tools and -dev
12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached

May 30

12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
08:26 wm-bot: petrb: updated sql command

May 29

21:05 wm-bot: petrb: running sudo apt-get install php5-gd

May 28

20:00 wm-bot: petrb: installing p7zip-full to -dev and -login

May 27

08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted

May 24

08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
08:28 petan: created 2 more exec nodes, setting up now...

May 23

09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20

May 22

20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
14:28 labs-logs-bottie: petrb: installed netcat as well
14:28 labs-logs-bottie: petrb: installed telnet to -dev
14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there

May 21

20:27 labs-logs-bottie: petrb: uploaded hosts to -dev

May 19

13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
12:50 petan: nvm previous line
12:50 labs-logs-bottie: petrb: vul alias viewuserlang

May 14

21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
13:16 Coren: reboot -dev, need to test kernel upgrade

May 10

15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation

May 9

04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!

May 6

19:59 Coren: made tools-dev.wmflabs.org public
08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed

May 4

17:50 labs-logs-bottie: petrb: foobar as well
17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar

May 2

20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to

May 1

16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home

April 27

18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)

April 26

23:55 Coren: reboot to finish security updates
08:00 labs-logs-bottie: petrb: patching qtop
07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there

April 25

19:00 Coren: Maintenance over; systems restarted and should be working.
18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
18:01 Coren: Begin maintenance (login disabled)
13:21 petan: removing local-wikidatastats from ldap

April 24

13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
11:32 legoktm: wikidatastats attempting to install limn
11:15 labs-logs-bottie: petrb: installing npm to -login instance
07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P

April 23

13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
12:14 labs-logs-bottie: petrb: qtop on -dev
12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way

April 19

22:38 Coren: reboot -login, all done with the NFS config. yeay.
17:13 Coren: (final?) reboot of -login with the new autofs configuration
16:24 Coren: (rebooted -login)
16:24 Coren: autofs + gluster = fail
14:45 Coren: reboot -login (NFS mount woes)

April 15

22:29 Coren: also a test; note how said bot knows its place. :-)
22:14 andrewbogott: this is a test of labs-morebots.
21:49 andrewbogott: this is a test
15:41 labs-logs-bottie: petrb: installing p7zip everywhere
08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box

April 11

22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
07:42 labs-logs-bottie: petrb: removed reboot information from motd

Server Admin Log

Template:Server Admin Log

Nova_Resource:Rcm.cac/SAL

2016-05-07

22:37 Luke081515: updating the repos

2016-05-04

18:00 Luke081515: updating to master

2016-05-01

20:51 Luke081515: Updadateing repos & databases

2016-04-22

20:50 Luke081515: Upgrading all repos

2016-04-12

21:35 Luke081515: updating all repos before testing a patch from gerrit

2016-04-02

23:32 Luke081515: updating the repos

2016-03-30

23:41 Luke081515: updating all repos

2016-03-29

00:37 Luke081515: updating all repos to master

2016-03-26

16:44 Luke081515: updating all git repos

2016-03-25

14:10 Luke081515: enabling role monobook
14:04 Luke081515: starting the vm and running git-update
00:33 Luke081515: rebuild the instance

Nova_Resource:Tools.wikibugs/SAL

2016-05-06

19:47 valhallasw`cloud: temporarily offline for Danny_B's batch job
09:12 wikibugs: Updated channels.yaml to: 904ce69d7f515ea9b47f621686f612a04ae94cc2 Send WMDE-Design to #wikimedia-de-tech

2016-05-05

21:37 wikibugs: Updated channels.yaml to: 6e0a2d140f923140c73b0dd5a356816a1cb47aa5 Add wildcard for Collaboration team (Collab-Team(-.*)?)

2016-04-22

00:38 wikibugs: Updated channels.yaml to: 98b78de2f209713741863c9dbf5a14eba7164ccd Update for renamed xTools project

2016-04-21

18:04 wikibugs: Updated channels.yaml to: 68d22345224b2e64698527a65b21c1a9cba9f84f Added #wikimedia-interactive

2016-04-15

16:40 wikibugs: Updated channels.yaml to: 63a166edb4d8218e1edf42346b5b8a90e4b6d647 Merge "Revert "Log hackathon tasks to #wmhack""

2016-04-11

17:58 ircnotifier: valhallasw: Deployed 170e3ace519867782ecb709eb095059b063d1cd1 Merge "Try to fix display of colours appearing directly before numbers" wb2-irc
17:57 wikibugs: Updated channels.yaml to: 4acf9e002ad00a8af3553833f99b754bfc0e189c Merge "Add #wikimedia-ai channel"

2016-04-08

22:51 wikibugs: Updated channels.yaml to: 2fc4b94daebe62f8e5e8712d753fabb5878e418a Echo was renamed to Notifications

2016-04-01

09:54 wikibugs: Updated channels.yaml to: a1754bac96c8fcb14532b808d5ca9db15e8f3c25 Merge "Log hackathon tasks to #wmhack"

2016-03-29

18:53 wikibugs: Updated channels.yaml to: ad14e0d1a0a07b775d014e6f8f1edaf145349116 Add two Collaboration team boards

2016-03-28

02:30 wikibugs: Updated channels.yaml to: fcffc4d11cddf0e11b5353bf50a9c36f4d989090 Send Community-Wishlist-Survey stuff to #wikimedia-commtech

2016-02-18

18:58 ircnotifier: legoktm: Deployed 8e84aa72b250ef18421c47ace7b4422e949c5837 Ignore comments posted to Phabricator by Stashbot wb2-irc

2016-02-16

16:30 wikibugs: Updated channels.yaml to: f4951fed10578ac63d80a6ec95e0927aaa90411e Remove ContentTranslations bugs from #mediawiki-i18n

2016-02-11

02:16 wikibugs: Updated channels.yaml to: 164d65c02aa22ec6e53f05ec74c35dfd58d11c24 -releng and -devtools changes

2016-02-08

22:10 wikibugs: Updated channels.yaml to: 2dd0d574c0e2bfcd5285493664f884f2ddc54b99 Send Education-* to wikimedia-ed

2016-01-07

20:53 wikibugs: Updated channels.yaml to: db26b7db94db89a49fac63df54d0189cf39ffc90 Send Labs* to `#wikimedia-labs`

2015-12-21

18:31 valhallasw`cloud: and restarted with fab start-jobs. Welcome back, wikibugs.
18:30 valhallasw`cloud: ah, there are SGE processes running. OK, killing those as well.
18:28 valhallasw`cloud: what's even weirder is that it starts both wikibugs.py and redis2irc.py, which are two distinct SGE jobs. Uuh?
18:27 valhallasw`cloud: yet it respawns! What on earth. Again from 208.80.155.186, and killed again.
18:26 valhallasw`cloud: killed wikibugs manually, no SGE in sight.
18:24 valhallasw`cloud: using `listlogins` in nickserv, we find one running on 208.80.155.186 (-1409), one on 208.80.155.145 (-1405, just restarted)
18:20 valhallasw`cloud: duplicate wikibugs, trying qmod -rj

2015-12-07

20:39 valhallasw`cloud: wb2-irc thinks it's connected but messages don't actually get out to IRC. Restarting.

2015-12-03

23:11 wikibugs: Updated channels.yaml to: 74f9c1e0e07d47abc0ca706040faaf90b1ea585d Add PAWS to pywikibot and labs channels

2015-11-04

16:44 wikibugs: Updated channels.yaml to: 9a7f239ec5c34604d9e48901fc28e997ea53a5e4 Add #Testing-Initiative-2015 to -releng
04:31 wikibugs: Updated channels.yaml to: 775987cc6b7998d7495fcae652546ab2df0d1d6a Send all User-* projects to /dev/null

2015-10-28

18:28 wikibugs: Updated channels.yaml to: e5e90fdb7faaa2b992321b1facd2799ae25d61e7 send Mailing lists tickets to #wikimedia-mailman

2015-10-21

18:32 wikibugs: Updated channels.yaml to: b4e285f9673929b8547902be466e04d903d3237d Add WMDE-Analytics-Engineering to #wikimedia-de-tech

2015-10-07

16:57 wikibugs: Updated channels.yaml to: c78efa6e621b316c26fca060661f59558e8bafa5 Merge "Also exclude TCB-Team- from #wikimedia-fundraising"

2015-10-06

00:34 wikibugs: Updated channels.yaml to: 9da0a4809b8d990d2a87d465868d8a8c8fd549b1 Send Beta-Cluster-Infrastructure to #wikimedia-releng

2015-09-27

19:34 wikibugs: Updated channels.yaml to: d2f4a855aa8e5f6a9d06e43c35abcc9448f57b73 Add MediaWiki-Codesniffer to -releng

2015-09-24

17:41 wikibugs: Updated channels.yaml to: bf1ac0ad9fd2f358aeb62335516c6ca4304b649d Add #MediaWiki-Releasing to -releng
16:40 wikibugs: Updated channels.yaml to: eca76c2669d6b0d308b2457b146dbef7f5f91d26 Add #releng-epics to -releng

2015-09-23

14:02 wikibugs: Updated channels.yaml to: 06f2a7d0a89baac1a1255c8e2524afa9654e09c1 Send all Community-Tech-* traffic to #wikimedia-commtech

2015-09-22

22:54 wikibugs: Updated channels.yaml to: 40a3bfed7706047eb5df3b7f36fda29c54cec3fc Pywikibot-Flow → #wikimedia-collaboration

2015-09-19

05:02 wikibugs: Updated channels.yaml to: 4d6ce23148a4fa57c84139465beeb84361ebb6de Add releng-(.*) to catch all releng planning tags

2015-09-18

09:20 wikibugs: Updated channels.yaml to: 6a3566594f734c84b9e7d85600ae39b33ab366de releng is now Release-Engineering-Team
03:11 wikibugs: Updated channels.yaml to: f9dab0c9be84689bc24046982ef22e22d45402b7 OTRS → #wikimedia-otrs

2015-09-16

17:27 ircnotifier: legoktm: Deployed bcce439fda97c0a91b6ef983221f336a3da0cf99 Wait at least 1 second before pushing into redis wb2-phab
15:04 wikibugs: Updated channels.yaml to: 08ac39ff3184c179434bb9f36187adcea5ea8f24 Remove ECT and old/dead projects from -devtools

2015-09-07

11:07 wikibugs: Updated channels.yaml to: 04c06838cc50d916c6cc11b20776a62b8b5fbdc1 Report ArticlePlaceholder to #wikidata-feed

2015-09-04

20:56 wikibugs: Updated channels.yaml to: 0a914ec9da79d3e0b8a6dcbe825a2ac54eb03446 add notifs for #wikimedia-ios room
03:13 wikibugs: Updated channels.yaml to: d18a53d6498e5faa77a03b35261f9d26fd51766d Add #Differential for -releng

2015-09-03

16:26 wikibugs: Updated channels.yaml to: 6790b1bed753260168db20bb78dcfa5726be2aec Deprecate gitblit, and migrate gerrit
05:46 ircnotifier: legoktm: Deployed 1564da8bd2a53f9899e93497f03ba13e4a6b734f Forrestbot → ReleaseTaggerBot wb2-irc

2015-09-02

00:03 wikibugs: Updated channels.yaml to: 5f5fbb9243566dd512f1b1bf65ed60e5e08d6e92 Naming is hard

2015-08-28

05:14 wikibugs: Updated channels.yaml to: 30a3e422993291ab995a487ae1d4bbc2e7cd4013 Send Community-Tech traffic to #wikimedia-commtech

2015-08-25

18:04 wikibugs: Updated channels.yaml to: a541227fe36479f99a74f8d56586bcf4b8f55108 Send Collaboration-Team(-.*)? to #wikimedia-collaboration

2015-08-19

08:25 wikibugs: Updated channels.yaml to: b03ba14e0809cf29d6fad807c931b7b9bafb0b2f Add RelEng-Admin, CI-Config, Scap3 plus reorder

2015-08-04

17:48 wikibugs: Updated channels.yaml to: 5b039d4ec16094a553be431e4d94f0bf880cfa47 Send GlobalRename changes to #wikimedia-rename

2015-07-30

21:11 wikibugs: Updated channels.yaml to: f638b92139d824b52c8e37284b4e5bedf07cf52c Filter WMDE- out of #wikimedia-fundraising

2015-07-28

20:59 ircnotifier: legoktm: Deployed 680c8aad81158a3ddb1c4018233c07729c163cc0 Don't notify if multiple ignored actions were triggered wb2-phab

July 2

18:21 wikibugs: Updated channels.yaml to: ac57db111909071aa63faf61ff9a2a1fee1c693f xtools moved to wikimedia-xtools Change-Id: Ie84324e718d8025f4a1381f36d9ff5f4e9c5848d
17:52 valhallasw`cloud: restarting wikibugs using fab to get verbose logs & logrotate back
17:33 ircnotifier: valhallasw: Deployed 0f163852ed56e50bbfeb53377a0913570ee21fea Merge "Revert "Use NOTICE instead of PRIVMSG"" wb2-irc
17:29 ircnotifier: valhallasw: Deployed bd6cbfabe79cc254c9525f136ae981ef20479c1e Merge "Use NOTICE instead of PRIVMSG" wb2-irc

June 15

20:13 wikibugs: Updated channels.yaml to: 317ea9408296ac9c0e0b8cfe3b9fe1952ac57f04 Change #wikidata to #wikidata-feed

June 10

11:38 wikibugs: Updated channels.yaml to: a8b1cb73fc9c7f07e8d8329a5fac09f4974a2c5d Add 3 probjects to #wikimedia-de-tech

June 9

17:21 wikibugs: Updated channels.yaml to: fb7b824b7b6310ecbe957d07d66eaa4b1dbb8e6e Move ResourceLoader and Performance-Team #wikimedia-perf

June 5

20:59 ircnotifier: legoktm: Deployed 82b0b9f487ece85a40595b80f3f690554743e472 Ignore Forrestbot wb2-phab, wb2-irc

May 25

08:18 ircnotifier: valhallasw: Deployed 82b0b9f487ece85a40595b80f3f690554743e472 Ignore Forrestbot wb2-phab, wb2-irc

May 19

18:50 ircnotifier: valhallasw: Deployed f9b9d5bda60b9f1f6aac196254f8b6cfff6d58a2 Send Graph-VE MW extension project to VE channel wb2-phab, wb2-irc

May 2

12:37 wikibugs: Updated channels.yaml to: f9b9d5bda60b9f1f6aac196254f8b6cfff6d58a2 Send Graph-VE MW extension project to VE channel

May 1

22:10 wikibugs: Updated channels.yaml to: 831099cc50dbc6828c2ef5ff8f2e6aa41cd97310 Put a few things into -editing.
21:55 wikibugs: Updated channels.yaml to: b6c7fa03a61f5b27061be11900b6e432d500b765 Remove definitions for #wikimedia-mobile

April 29

16:43 ircnotifier: valhallasw: Deployed 1d785dc3ad22a434749f8ec0d466180f3de9ea52 channels: Continuous-Integration is now Continuous-Integration-Infrastructure wb2-phab, wb2-irc

April 24

21:14 wikibugs: Updated channels.yaml to: 1d785dc3ad22a434749f8ec0d466180f3de9ea52 channels: Continuous-Integration is now Continuous-Integration-Infrastructure

April 21

03:39 ircnotifier: legoktm: Deployed 8e88fc89deaa41b2a720845f5d20aa871ffa09d9 Add Blueprint skin to notify list for #wikimedia-design wb2-irc
03:39 ircnotifier: legoktm: Deployed 8e88fc89deaa41b2a720845f5d20aa871ffa09d9 Add Blueprint skin to notify list for #wikimedia-design wb2-phab

April 20

11:18 wikibugs: Updated channels.yaml to: 8e88fc89deaa41b2a720845f5d20aa871ffa09d9 Add Blueprint skin to notify list for #wikimedia-design

April 18

20:24 valhallasw`cloud: file system corruption?? channels.yaml is all \x00s and .git/objects/* is corrupt. Cleared .git/objects, git fetch --all'd and git checkout channels.yaml seems to bring wikibugs back to life
19:43 valhallasw`cloud: tools-redis doesn't respond to commands, which could explain why wb2-phab was hanging. But why is tools-redis completely broken?
19:41 valhallasw`cloud: now wb2-phab is functioning again, but wb2-irc is not reporting?! Restarting that as well
19:38 valhallasw`cloud: that is, the last message to irc. The bot is still running and doing ping/pongs. However, wikibugs.log is completely silent after that time. wb2-phab.err does have errors, but without timestamps, so it's basically useless. Restarting wb2-phab to see if that helps
19:36 valhallasw`cloud: last message in redis2irc.log was 2015-04-18 02:10:26,157
19:36 valhallasw`cloud: wikibugs has broken down again. Trying to figure out why.

April 13

19:05 wikibugs: Updated channels.yaml to: 4500101f021b8eec83899848932edaee98bd680a Merge "Tools-Labs-xTools to #xtools"

April 7

16:13 wikibugs: Updated channels.yaml to: 8a4346f6c5d0f826a9b3099d5f76339d7a64dcad Merge "Remove Quality Assurance from -releng"

April 1

20:56 wikibugs: Updated channels.yaml to: f09815aee08458b7fb283db7c7e0aed49e3b149d HACK: Always join channels on privmsg
20:56 ircnotifier: legoktm: Deployed f09815aee08458b7fb283db7c7e0aed49e3b149d HACK: Always join channels on privmsg wb2-irc

March 30

19:45 wikibugs: Updated channels.yaml to: b2c38567b32d82881baab5c3227f14a9b8e9fff5 Send MediaWiki-API-Team and Blocked-on-MediaWiki-API-Team to #mediawiki-core

March 23

18:44 wikibugs: Updated channels.yaml to: 90eed2a902164a9a1cf7930c7d9fb599ec9ae660 Send Commons to #wikimedia-commons-tech, per Steinsplitter

March 18

11:49 ircnotifier: yuvipanda: Deployed 23240bd0dc5aebcc2a94b6f1ac268e2e3ad41114 Add more projects for devtools and mobile wb2-phab, wb2-irc

March 16

17:16 wikibugs: Updated channels.yaml to: 23240bd0dc5aebcc2a94b6f1ac268e2e3ad41114 Add more projects for devtools and mobile

March 13

22:29 legoktm: restarted wb2-irc to see if it rejoins channels properly
20:44 wikibugs: Updated channels.yaml to: 9eafe437ff005a3232e7f7e89dbb2be54437f76c tox: Rename channels env to standard py34

March 11

04:36 legoktm: restarted both wb2-phab and wb2-irc

March 10

19:30 wikibugs: Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt

March 9

21:11 ircnotifier: legoktm: Deployed 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt wb2-irc
20:36 wikibugs: Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt
20:26 wikibugs: Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt
20:18 ircnotifier: legoktm: Deployed 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt wb2-irc

March 8

19:14 wikibugs: Updated channels.yaml to: 614ee42338f6ab3f8d0705d3f0358523189af00e send WMT bugs to #wmt

March 3

20:02 wikibugs: Updated channels.yaml to: 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects"
16:44 ircnotifier: legoktm: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-irc

February 28

22:57 valhallasw`cloud: also restart wikibugs; it seems the PRIVMSGs in the log don't actually show up on irc
22:53 valhallasw`cloud: false alarm, messages were reported (2015-02-28 22:50:43,202 - irc3.wikibugs - DEBUG - > PRIVMSG #mediawiki-parsoid :10Parsoid, 10VisualEditor, 10VisualEditor-EditingTools, etc) which is a few minutes ago
22:52 valhallasw`cloud: restarted wb2-phab to see if we get stuff from phab again

February 24

23:16 ircnotifier: legoktm: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-irc
23:16 ircnotifier: legoktm: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-phab

February 22

22:54 ircnotifier: valhallasw: Deployed 6da78462504cd023e0c31babb5cc56a7eae3a88a Merge "Use brown instead of red for orange (=release) projects" wb2-irc
21:44 ircnotifier: legoktm: Deployed 1f579477957417a693308fa9a23d2080821eb551 Volunteer? --> Lowest (priority) wb2-irc

February 20

20:04 wikibugs: Updated channels.yaml to: 26b8b9f5f812b092b02d33b3b29cf448dafb663a More fundraising projects to #wikimedia-fundraising
18:46 ircnotifier: valhallasw: Deployed 4cf33271a75d3655addac95cc16413ab1adc6488 Merge "Always show four tags, most relevant first" wb2-irc

February 18

17:31 ircnotifier: legoktm: Deployed 8ba77ed2d2c039a231f3265da01215e721480ce0 Merge "Log ALL the things!" wb2-phab

February 17

22:32 wikibugs: Updated channels.yaml to: 8ed0a167e287b7c3374f8b9b7e556e9b4b6180d6 Send AutoWikiBrowser to #autowikibrowser

February 16

13:51 valhallasw`cloud: reverted locally (git revert <new formatting commit>) and restarted as it was breaking people's workflows
04:35 wikibugs: Updated channels.yaml to: eb4a51a4628a6b26d4e798b3e55d8749231bb72c Add Blocked-on-RelEng to -releng
00:54 ircnotifier: legoktm: Deployed 4c82585a9c01bceeb91acabbc5b481ea5928327d Merge "Send Wikibase stuff to #wikidata" wb2-irc
00:53 ircnotifier: legoktm: Deployed 4c82585a9c01bceeb91acabbc5b481ea5928327d Merge "Send Wikibase stuff to #wikidata" wb2-phab
00:05 ircnotifier: legoktm: Deployed d5922a4d10169ec8870e55de1b74ea9e39dc8c5c Make sure URL is always present wb2-phab
00:05 ircnotifier: legoktm: Deployed d5922a4d10169ec8870e55de1b74ea9e39dc8c5c Make sure URL is always present wb2-irc

February 13

18:26 ircnotifier: valhallasw: Deployed 0941e5af42ab1c035b023246da5dde30b17c0f63 Remove Phabricator and Code-Review from -releng wb2-irc
18:01 ircnotifier: legoktm: Deployed 0941e5af42ab1c035b023246da5dde30b17c0f63 Remove Phabricator and Code-Review from -releng wb2-phab
17:30 ircnotifier: legoktm: Deployed 0941e5af42ab1c035b023246da5dde30b17c0f63 Remove Phabricator and Code-Review from -releng wb2-irc

February 11

17:53 wikibugs: Updated channels.yaml to: 0941e5af42ab1c035b023246da5dde30b17c0f63 Remove Phabricator and Code-Review from -releng

February 7

22:38 ircnotifier: valhallasw: Deployed 6d78d47f1eae25b63f4cd322a6737db58b8d5c7a Rework logging infrastructure wb2-phab, wb2-irc

February 6

19:42 ircnotifier: legoktm: Deployed 1b6bbd391ad1f23a8270d3547b2540064e452d94 Fix project tag screen scraping wb2-phab

February 5

22:51 ircnotifier: valhallasw: Deployed d9a83a0d71b0dd4500d40ebba5232b2ded362be5 Assume channel list is utf-8 wb2-irc
22:13 ircnotifier: valhallasw: Deployed 9054845f4a69a7364f5270e2ada574f696e4f70f Add MoodBar to wikimedia-collaboration wb2-phab

February 3

06:05 ircnotifier: legoktm: Deployed 9054845f4a69a7364f5270e2ada574f696e4f70f Add MoodBar to wikimedia-collaboration wb2-phab
05:39 wikibugs: Updated channels.yaml to: 9054845f4a69a7364f5270e2ada574f696e4f70f Add MoodBar to wikimedia-collaboration

February 2

19:08 wikibugs: Updated channels.yaml to: 490e8ba1784e8ef7b04d2f51d2697f1d670d6cb1 Announce Staging bugs to -releng

January 30

04:28 wikibugs: Updated channels.yaml to: 4fe2e5b9f9d699d3547aba5b320fdf9ce1bd96b0 Send fundraising stuff to our channel

January 28

14:29 wikibugs: Updated channels.yaml to: 9f4845ee4937cc9bf890bc7ea2251ca6613080e0 Merge "Labs-Team was renamed to Labs"

January 22

22:21 wikibugs: Updated channels.yaml to: 6e130fecc19a39a5caed12ca0dda25ad28df62f0 -WikidataRepo is now -WikidataRepository

January 19

20:52 valhallasw: is this really broken? :(

January 15

20:03 ircnotifier: legoktm: Deployed c61edcfab64d62081edc3ccf89534764017f4a1c Make sure we're in the channel before messaging it wb2-irc

January 14

22:52 ircnotifier: legoktm: Deployed 492438a4da3bd10a6e53bd248c997a02edb9d781 Fix wikibugs after Phabricator update wb2-phab
12:17 ircnotifier: yuvipanda: Deployed 9521ec19491d35ebc40fdccb34e75e0bd7f9399f Turn ssl off wb2-phab, wb2-irc
12:10 ircnotifier: yuvipanda: Deployed 8736032750b4fead35646ea9120621bf9d0ccb7e Only join actual channels wb2-phab, wb2-irc
12:09 ircnotifier: yuvipanda: Deployed 8736032750b4fead35646ea9120621bf9d0ccb7e Only join actual channels wb2-phab, wb2-irc
11:05 ircnotifier: yuvipanda: Deployed 2b66af26ca2a7343d0743423a4c9fcc6b8296e5e Disentangle tag lists for filtering vs display wb2-phab, wb2-irc
11:02 ircnotifier: yuvipanda: Deployed 2b66af26ca2a7343d0743423a4c9fcc6b8296e5e Disentangle tag lists for filtering vs display wb2-phab, wb2-irc

January 12

22:04 wikibugs: Updated channels.yaml to: f1ee8fb8bc64186a1613ec9f9faf0aef6315759a Merge "team-practices -> #wikimedia-teampractices"

January 11

00:06 wikibugs: Updated channels.yaml to: 2257da8655036ce4555e88c01dad4a85f0b7946e WM-Bot -> #wm-bot

January 10

22:26 wikibugs: Updated channels.yaml to: f6b5ed8212e566b726a59de69a707ebea6c70d4e Remove -qa from announce list (moved to -releng)
18:06 wikibugs: Updated channels.yaml to: 09630c2cd5beead10c0ab2bfd8df84e7337a0208 LabsDB-Auditor -> labs

January 9

18:28 wikibugs: Updated channels.yaml to: 4c4fc344a850a36ef47a7c9965c853a27863baac Make sure to reset to origin/master, and show current sha1 before doing so

January 8

10:59 wikibugs: Updated channels.yaml to: 29b1c027a31c7650094b195e70e3a4ac82c05d00 Merge "Add Wikibugs to -labs"
09:51 wikibugs: Updated channels.yaml to: 019f6b0366a97df69733f7c80303aec8058ecb79 Wikibugs should listen to the Multimedia project for the multimedia channel

January 7

23:32 wikibugs: Updated channels.yaml to: 9538cc69ef4226d248a38fa86dadca6d646b6b37 Merge branch 'master' of https://github.com/wikimedia/labs-tools-wikibugs2
15:29 wikibugs: Updated channels.yaml to: cc8bc876e23c6b58f06a7379273f34e858b6ade5 Merge branch 'master' of https://github.com/wikimedia/labs-tools-wikibugs2

January 5

21:32 wikibugs: Updated channels.yaml to: 9003536427ce097a62c3a8cec310f7ca4f0edab0 Merge branch 'master' of https://github.com/wikimedia/labs-tools-wikibugs2
20:52 wikibugs: Updated channels.yaml to: c45fa33e94f5a34fa7618eaad1669104ace2a342 Merge branch 'master' of https://github.com/wikimedia/labs-tools-wikibugs2

December 31

16:48 wikibugs: Updated channels.yaml to: 0ba0b2c47cd593b64c4149931bfdaf022dff230c Merge branch 'master' of https://github.com/wikimedia/labs-tools-wikibugs2

December 22

23:18 ircnotifier: legoktm: Deployed 101deca5ee3884d19a61fe7098f8296ddb0c43e0 Escape newlines in IRC output wb2-irc

December 18

20:47 wikibugs: Updated channels.yaml to: 9ad8e090c8fb06d487b89255562483f08cf354e3 Send Spam-* to /dev/null
20:27 ircnotifier: legoktm: Deployed 3dc8fd7f3f8fdaacec5998913278179382b8594f Report IRC using Python and Yuvi's ircnotifier wb2-irc

December 17

00:39 wm-bot: legoktm: Deployed 8502072659ddb8c55ae45026d54867771e3122e7 redis2irc: join channels after reloading config wb2-irc
00:28 wm-bot: legoktm: Deployed 8502072659ddb8c55ae45026d54867771e3122e7 redis2irc: join channels after reloading config wb2-irc

December 16

23:35 wikibugs: Updated channels.yaml to: 432e66a45273e9798e26a8df08caa5a102eeec97 Add #wikimedia-services reporting
22:05 wm-bot: valhallasw: Deployed 3ec300c6605ed2087ad6bf25bf43abb4c0319d18 fab: set use_ssh_config = True (no jobs restarted)
21:46 wm-bot: valhallasw: Deployed 366f1b524cb4aecbdf4825a8b96e9f66524fa727 Add fabric runner wb2-phab
21:14 wikibugs: Updated channels.yaml to: 9649aa14cf1b8fd63a0e6efd3ac1aff0c351b141 Auto-detect changes to channels.yaml and !log it
21:06 wm-bot: valhallasw: Deployed 9649aa14cf1b8fd63a0e6efd3ac1aff0c351b141 wb2-phab, wb2-irc
20:44 legoktm: restarting for https://gerrit.wikimedia.org/r/180245

December 1

14:31 YuviPanda: killed irc bot for now

November 29

21:25 legoktm: restarting for https://gerrit.wikimedia.org/r/176486
21:14 legoktm: restarting for https://gerrit.wikimedia.org/r/176483
14:19 valhallasw`cloud: deployed 0a6dedd75e203f5005a15e340b2fed5ba4c67224

November 25

23:21 legoktm: restarted wikibugs.py listener for https://gerrit.wikimedia.org/r/175890
00:07 legoktm: restarting wikibugs for -qa changes

November 24

18:29 legoktm: restarting wikibugs for https://gerrit.wikimedia.org/r/175474
02:54 legoktm: RIP pywikibugs

November 18

11:07 valhallasw`cloud: deployed https://github.com/legoktm/wikibugs2/commit/842d2d25a827dd2311ed98d1e4cd8af078bf10bb

October 11

20:14 legoktm: deploying https://gerrit.wikimedia.org/r/166217

September 24

04:45 legoktm: deployed https://gerrit.wikimedia.org/r/162201 (add OpenStackManager component to #wikimedia-labs)

August 19

19:06 valhallasw`cloud: new version deployed (gerrit 143239 and 143238)

July 1

16:47 valhallasw: restarted wikibugs with new channel config / https://gerrit.wikimedia.org/r/#/c/142992/ / Nemo_bis

May 22

19:14 valhallasw: changed git repo to have gerrit as master
12:27 valhallasw: [email protected] delivery functional again and wikibugs is correctly reporting to IRC
12:18 valhallasw: gmail-to-wikibugs delivery is now functional; hopefully [email protected] delivery too...
12:17 valhallasw: mail delivery broken; direct mails complain about open("~/mailout.log", "a") in to_redis.py; commented out those lines

April 30

20:59 valhallasw: Deployed a48a000

April 28

20:11 YuviPanda: restarted wikibugs, seems to have died
11:54 YuviPanda: deployed bf1be7b
06:42 valhallasw: deployed 2.0-1-gb7f4290
06:23 valhallasw: Merging and deploying b7bbf92
06:21 valhallasw: NameError: name '_wsp_splitter' is not defined in /data/project/wikibugs/src/pywikibugs/get_unstructured.py. Apparently the line 'from email._header_value_parser import _wsp_splitter, _validate_xtext' had not made it into the git repo, and was cleared by accident on deploymeny
06:17 valhallasw: wikibugs stopped reporting; investigating

April 27

18:39 valhallasw: Updated wikibugs to 245f2a2
18:36 valhallasw: Updated wikibugs to 407ad66

Nova_Resource:Tools.heritage/SAL

2016-05-06

19:47 Lokal_Profil: Deployed latest pywikibot-core/2.0 from Git
19:26 Lokal_Profil: Deployed latest from Git, a724279 , d9ae73d (reverts 766d814 )
18:42 Lokal_Profil: Deployed latest from Git, 2d3ee40 (T39974), 766d814, e2fac07 and d2c242a (T39422)
14:58 JeanFred: Deployed latest from Git: e5a9f01 and d509343 (T134567)
12:15 JeanFred: Deployed latest from Git: db46042, c765e76, b5a731a, 7c27207, d4de720 (T134236), e7823ab & c83003b (T132647), 615ab28

2016-04-20

13:01 JeanFred: Deployed latest from Git, 48bce77 and dfbff9b (T132029)

2016-04-01

22:51 multichill2: JeanFred did a git pull for Phab:T131344 and others

2016-03-31

09:14 multichill: Commented out the Russian Wikipedia in user-config.py for Phab:T131344

2016-03-16

20:45 multichill: jsubbed populate_image_table.py for https://phabricator.wikimedia.org/T130107 (see crontab -l for exact command)

2015-08-30

14:38 multichill: Made local change to unused_images.py to get it to work, see https://phabricator.wikimedia.org/T110829
09:14 multichill: Updated ~/pywikibot to latest version, but still getting a FamilyMaintenanceWarning

2015-08-22

13:35 JeanFred: After backporting all local changes to Gerrit, updating local checkout to latest Git version.

2015-07-15

16:50 JeanFred: Checked out pywikibot-core

February 23

20:30 multichill: Merged https://gerrit.wikimedia.org/r/192258 , but can't deploy it because api/includes/FormatHtml.php has local (I18n) changes. Anyone feels like fixing?

December 21

11:49 multichill: After the toolserver.org dns move the http://toolserver.org/~erfgoed/ redirects seem to be broken. Akoopal mentioned this, see https://lists.wikimedia.org/pipermail/labs-l/2014-December/003216.html

September 20

15:56 multichill: Fixed https://bugzilla.wikimedia.org/show_bug.cgi?id=70806 and deployed new 2 new sk tables

August 27

18:05 multichill: Added Oren to the project

July 10

13:25 multichill: dns was broken, because of that api has been acting up for the last 2 (?) hours
11:46 lokal-profil: Corrected commands at commons:Commons:Monuments_database/Harvesting
11:20 multichill: Created ~/temp so that the change in https://gerrit.wikimedia.org/r/#/c/145254/1/api/includes/Defaults.php doesn't produce an error any more
09:57 lokal-profil: Images and markers in Kml now load from // instead of http://, gerrit
09:42 lokal-profil: Added se-arbetsliv a list for Working Life Museums in Sweden, gerrit
09:42 lokal-profil: Updated Default.php to point to toollabs instead of toolserver, gerrit

June 29

21:43 multichill: I put the RCE mysql conversion User:Akoopal made in ~/rce-nl-data . Still need to import it in Mysql to be useful. Data is CC0
19:30 multichill: Web service was down for all accounts. Back up and running. Api seems to have been down from 19:30 to 21:15 (Amsterdam time)
13:36 multichill: Burned the old ~erfgoed account on the Toolserver and uploading the backup to ~/toolserver_backup/

June 19

17:10 multichill: Fixed database_statistics.py after notification on https://commons.wikimedia.org/wiki/Commons_talk:Monuments_database/Statistics#Bug_in_the_URL . Still have to commit it

June 15

11:17 multichill: Did some hacks with Krinkle to get i18n working(ish) again (api.php and html formatters). Still need to commit it

June 14

19:28 multichill: Did the first steps to import the data to Wikidata. I wonder when we can deprecate the monument database
19:26 multichill: I sent out the Toolserver will die email. http://lists.wikimedia.org/pipermail/labs-l/2014-June/002672.html . I plan to drop the database p_erfgoed_p on the 21st.
11:33 multichill: Added Lokal Profil per request at Commons

June 7

16:57 multichill: While updating documentation I found https://git.wikimedia.org/summary/wikimedia%2Fwlm-api . Should probably be dropped, everything is in http://git.wikimedia.org/log/labs%2Ftools%2Fheritage.git
16:12 multichill: http://toolserver.org/~erfgoed/ now redirects to http://tools.wmflabs.org/heritage/ . Didn't move everything so that might give some 404's
16:06 multichill: prox_search completed without problems. update_monuments.sh should now run without failures.
15:52 multichill: symlinked ~/prox_search, fixed path (need to commit that), create_table_prox_search.sql , doing manual run
15:35 multichill: Had to increase memory for statitics to 512M. Still need to commit that. jsubbed build_stats_test again and it finished with Memory usage: 396588928
15:04 multichill: Symlinked ~/public_html/maintenance and create tables statistics and statisticsct. jsubbed build_stats_test to test it
14:49 multichill: Fixed populate_adm_tree.php and populated the table. Still need to commit it

June 4

20:04 multichill: Managed to get the image database updated by switching latin1 -> utf8. Still have to commit. https://commons.wikimedia.org/wiki/Commons:Monuments_database/Indexed_images/Statistics
19:58 multichill: Pointed https://commons.wikimedia.org/wiki/Template:Monuments_database_more_images to the api on labs. Was 15K hits on the Toolserver (?!)
19:23 multichill: https://gerrit.wikimedia.org/r/137398 pretty images live, see http://tools.wmflabs.org/heritage/api/api.php?action=images&imcountry=ad&imid=100&format=html&props=img_name
19:17 multichill: Fixed the mysqldump and enabled /data/project/heritage/erfgoedbot/populate_image_table.py

June 1

20:18 multichill: Set up cron to run the update_monuments job every night. Some parts of it will still fail.
20:05 multichill: Some tweaks in https://gerrit.wikimedia.org/r/136683 database is filled. Api is working (admintree and statistics still missing)
17:14 multichill: Updated ~/bin/create_all_monuments_tables.sh and created 105 tables. Fired up update_database.py to fill the database
17:01 multichill: Pull pywikibot (compat) and heritage. Symlinked it and setup the bot
16:46 multichill: Moved erfgoedbot, public_html & pywikipedia to ~/old/. to make room
16:41 multichill: Fixed ~/.database.inc , still have to do the i18n part
16:35 multichill: Cleaned out some code in https://gerrit.wikimedia.org/r/136649 and merged it
16:18 multichill: Created the s51138__heritage_p database
16:16 multichill: Replaced the .my.cnf with the right credentials

Release Engineering/SAL

Template:Release Engineering/SAL

Nova_Resource:Tools.admin/SAL

2016-05-06

18:45 bd808: Unbroke webservice
18:44 bd808: restarted webservice

Nova_Resource:Mobile/SAL

2016-05-05

23:22 bd808: Deleted jitsu instance. Replaced with https://tools.wmflabs.org/hatjitsu/

2015-08-24

17:05 YuviPanda: kill instance android-build to make space on labvirt1007 (android-builder is the successor)

December 31

20:54 bd808: Added jhobs as a project member

January 15

08:43 andrewbogott: rebooted mobile-varnish

November 1

16:26 MaxSem: Deleted old instances mobile-solr2, mobile-solr3 and mobile-osm2

October 31

19:00 andrewbogott: rebased and updated puppet files on mobile-solr2

January 19

21:24 Ryan_Lane: adding DNS name for newly allocated IP (mobile-geo.wmflabs.org)
21:24 Ryan_Lane: associated new IP with mobile-en
21:24 Ryan_Lane: allocated a new IP
21:24 Ryan_Lane: upped the floating IP quota to 2

January 4

20:31 preilly: associated mobile-feeds host name on wmflabs.org domain
20:27 Ryan_Lane: allocated IP 208.80.153.216
20:25 Ryan_Lane: upped the quota for floating ips to 1

Nova_Resource:Redirects/SAL

2016-05-05

23:21 bd808: Configured redirect for hatjitsu.wmflabs.org

Nova_Resource:Math/SAL

2016-05-05

22:58 bd808: Joined project as admin

March 4

21:35 andrewbogott: (and also because Howie requested it)
21:34 andrewbogott: moved http://drmf-beta.wmflabs.org to point to the drmf-beta instance, and http://drmf.wmflabs.org to point to the drmf instance. Because previously it was the other way around which was super confusing.

September 16

19:36 andrewbogott: moving and rebooting mws instance

January 17

08:57 andrewbogott: moving math-semantics to a new virt host to avoid a storage crunch. This will reboot the instance.

January 15

08:48 andrewbogott: rebooted latexml-test

August 29

06:05 Ryan_Lane: adding jiabao to work on math support for visualeditor

view more projects

Labs Server Admin Log

Contents

Nova_Resource:Tools/SAL

2016-05-08

2016-05-05

2016-04-28

2016-04-24

2016-04-11

2016-04-06

2016-04-05

2016-04-04

2016-03-30

2016-03-28

2016-03-27

2016-03-18

2016-03-11

2016-03-02

2016-02-29

2016-02-28

2016-02-26

2016-02-25

2016-02-24

2016-02-22

2016-02-19

2016-02-18

2016-02-16

2016-02-12

2016-02-05

2016-02-03

2016-01-31

2016-01-30

2016-01-29

2016-01-28

2016-01-27

2016-01-26

2016-01-25

2016-01-23

2016-01-21

2016-01-12

2016-01-11

2016-01-09

2016-01-08

2015-12-30

2015-12-29

2015-12-28

2015-12-23

2015-12-22

2015-12-21

2015-12-20

2015-12-18

2015-12-16

2015-12-12

2015-12-10

2015-12-07

2015-12-06

2015-12-04

2015-12-02

2015-12-01

2015-11-25

2015-11-20

2015-11-17

2015-11-16

2015-11-03

2015-11-02

2015-10-26

2015-10-11

2015-10-09

2015-10-06

2015-10-02

2015-10-01

2015-09-30

2015-09-29

2015-09-28

2015-09-25

2015-09-24

2015-09-23

2015-09-16

2015-09-15

2015-09-14

2015-09-13