Blog

As of 4:39pm PDT:

  • Power was restored and deemed stable
  • We began bringing systems back online

As of 3:20pm PDT:

  • Most websites have been redirected to a page indicating that we are down for emergency maintenance.
  • The power has been shutdown.

As of 3:00pm PDT:

  • mailserver, email list, docushare and web servers are all down
  • backup storage is down
  • a 10 minute delay has been requested to finish bringing down a few stragglers

As of 2:05pm PDT:

  • We just received word that CalIT2 will have the power shutdown again at 3pm today.  We are scrambling to shutdown equipment and prepare for this unexpected/unplanned event.

=============

FM experienced a problem after the maintenance work this morning and need to shut power down again by around 3 pm.
Please shutdown all equipment ASAP, as necessary.  At this time we don't have an estimate of the window that FM requires.
=Tad
Tad Reynales, Manager
Technology Infrastructure
CALIT2 @ UC San Diego

=============


As of 12:00pm PDT:

  • All systems are powered on
  • website DNS entries are still being restored
  • We expect that everything will be back online and ready for end-user validation at approximately 1:30pm PDT

As of 10:45am PDT:

  • stage.nitrc is up
  • docushare is up
  • mail server is up, mail should start coming in
  • websites are coming back up

As of 10:30am PDT:

Power has been restored as of about 10 AM; FM reported successful completion of their work and have left Atkinson Hall as of ~10:30 AM.
=Tad
Tad Reynales, Manager
Technology Infrastructure
CALIT2 @ UC San Diego

As of 10:10am PDT:

  • Power has been restored at CalIT2 and we are starting to bring systems back online

As of 9:51 am PDT:

As of 9:43 this morning:

  • Mail to <username>@ncmir.ucsd.edu is being delayed
  • Websites are down, but we are attempting to redirect them to a maintenance page
  • All CAMERA resources have been shutdown
  • All CRBS resources hosted at CalIT2 have been shutdown
  • NITRC stage has been shutdown.

At 05:52:54 AM PDT A power event occurred affecting half of campus

It is confirmed that CalIT2 and suspected that Holly were affected.
Currently power has been restored a few systems that did not come up after power was restored.
Our mail server is included in this outage.

Sat, Nov 05 7:30 AM mail.ncmir.ucsd.edu.
Sat, Nov 05 5:57 AM dev-web.crbs.ucsd.edu.
Sat, Nov 05 5:57 AM tom.crbs.ucsd.edu
Sat, Nov 05 5:57 AM vm-dev-8.crbs.ucsd.edu.
Sat, Nov 05 5:56 AM stitch.crbs.ucsd.edu
Sat, Nov 05 5:56 AM drlittle.crbs.ucsd.edu.
Sat, Nov 05 5:55 AM lilo.crbs.ucsd.edu
Sat, Nov 05 5:55 AM tom.crbs.ucsd.edu.
Sat, Nov 05 5:55 AM 132.239.132.214
Sat, Nov 05 5:54 AM dolphin.crbs.ucsd.edu
Sat, Nov 05 5:54 AM vm0-apps.camera.calit2.net.
Sat, Nov 05 5:54 AM featherie.ucsd.edu.
Sat, Nov 05 5:54 AM navi.crbs.ucsd.edu
Sat, Nov 05 5:54 AM navi.crbs.ucsd.edu
Sat, Nov 05 5:54 AM vihar.crbs.ucsd.edu.
Sat, Nov 05 5:54 AM vihar.crbs.ucsd.edu
Sat, Nov 05 5:54 AM compute-0-8-0
Sat, Nov 05 5:54 AM portal-dev.camera.calit2.net
Sat, Nov 05 5:54 AM vm0-apps.camera.calit2.net.
Sat, Nov 05 5:53 AM compute-0-8-0
Sat, Nov 05 5:53 AM compute-0-8-0
Sat, Nov 05 5:53 AM leibniz.ucsd.edu.
Sat, Nov 05 5:53 AM portal.camera.calit2.net (JCVI a
Sat, Nov 05 5:53 AM www.wholebrainproject.org
Sat, Nov 05 5:53 AM stage-nitrc.crbs.ucsd.edu
Sat, Nov 05 5:53 AM www.wholebraincatalog.org
Sat, Nov 05 5:53 AM www.wholebraincatalog.org
Sat, Nov 05 5:53 AM tom.crbs.ucsd.edu
Sat, Nov 05 5:53 AM bacula.crbs.ucsd.edu
Sat, Nov 05 5:53 AM 132.239.132.247
Sat, Nov 05 5:53 AM lilo.crbs.ucsd.edu
Sat, Nov 05 5:53 AM stitch.crbs.ucsd.edu
Sat, Nov 05 5:53 AM lobster.crbs.ucsd.edu
Sat, Nov 05 5:52 AM porpoise.crbs.ucsd.edu
Sat, Nov 05 5:52 AM seabass.crbs.ucsd.edu

Thank you for your patience while we work on bringing these few systems back up.

If you notice anything UP but not working properly, please submit a ticket via our support website
Sincerely,

CRBS SysOps

As of 1:30pm PDT, if you notice anything not working properly, please submit a ticket via our support website.

There are a few dev and stage systems that still need to be brought online.

Additionally, we are working on getting the CAMERA cylume Rocks cluster, and it's associated systems, including the gama server, back online.

Thank you for your patience through this extraordinary event.

Sincerely,
CRBS SysOps

Please see our status web page for additional details.

8-28-2011 SDSC Relocation

The following servers/services should be operating nominally:

  • Everything

Delivery of the switch hardware we need was delayed. It did not arrive until 11:30am. As a result, there has been a corresponding slip in our schedule.

List of affected Virtual Machines (VMs)

A list of affected Virtual Machines (VMs) can be found here

Project Information

CAMERA

Intermittent network interruptions while the network is upgraded this morning.
Oracle databases and Oracle database servers unavailable during NetApp move this afternoon.
victory and constellation oracle servers have been relocated.

CCDB

Intermittent network interruptions while the network is upgraded.
While the maunaloa storage system is moved, the following data stores will be unavailable.

  • CellImageLibrary
  • HarvardData
  • Image Server "scratch" space

NIF

Intermittent network interruptions while the network is upgraded.

  • NIF1, NIF2, NIF4 and nif-crawler servers have been patched, updated and relocated.

NITRC

Intermittent network interruptions while the network is upgraded.
Aproximately 30 minute outage while nitrc.org bare metal server is relocated.

Work to be Done

switch hardware upgrade

Delivery of the switch hardware we need was delayed. It is due by noon today, via FedEx.

"Bare Metal" servers

Servers that are not virtualized will be moved this morning while we are installing the new switch hardware. This will impact:

  • nitrc.org
  • braininfo
  • the Oracle 3-node RAC system and databases hosted there.
  • maunaloa data storage
    • the SVN data repository
    • the CVS data repository

VM migration status

We are hoping the switches will arrive early enough to allow us to migrate the VMs over 10Gb, suspend them over 10Gb while we relocate the NetApp, and then resume them.

Oracle RAC move

Daniel Wei will be assisting us with the move of this equipment.

NetApp move

The NetApp will be one of the last pieces of equipment to be relocated. During this time, all VMs that use shared NetApp storage to facilitate disaster recovery will be suspended. CAMERA Oracle databases will be unavailable. WBC data stored on the NetApp will also be unavailable.

Work completed

VM migration

Four of our VM hosts have now been moved to the new location and a number of VMs have been migrated to these machines.

General Improvements

  • BIRN Portal and GAMA servers have been virtualized
  • NIF server system software updated

NIF status

  • NIF1, NIF2, NIF4, NIF5 have all been moved to the new location. Patches have been applied and network has been reconfigured.  Network has improved with adding redudent/failover links for all of the machines for both management and public networks.  Management network upgraded to 1GB.
  • Currently I have NIF's website pointing to a different webserver so that when the production nodes go down, we will still have a status page up for NIF.

We have upgraded power and all the necessary network drops in the new location. Progress is being impeded by the lack of redundant 10Gb layer 3 switches for the new location. We've done everything we can think of to expedite the procurement of this equipment, including requesting Saturday morning delivery via FedEx, if necessary.

In addition, we have moved the first VM host to the new location and plan to start migrating VMs to it tomorrow. Unfortunately, without the switches, the process of live migrating the VMs will take much longer than originally planned for.

Vicky

What is Going On

The Oracle RAC at SDSC did not come up properly after the power outage yesterday.

System Affected

oracle-rac1
oracle-rac2
oracle-rac3

Project(s) - Software/Application(s) Affected

CCDB - production iRODS (iCAT database is offline)

Steps Being Taken to Rectify the Situation

Daniel Wei has been contacted (left message) and his assistance has been requested. We are waiting for him to get in touch with us.

ETA (if known)

We upgraded the server and the applications for Jira, Confluence, Crowd, and Fisheye/Crucible.  This should improve performance, fix bugs, in general, and also improve the account management interface.

subversion.crbs.ucsd.edu is online

Use your crowd login - you must be a member of the crbs-sysops groups to write to the systems repo. 

Currently the puppet code has been migrated to SVN.  It was checked in at https://subversion.crbs.ucsd.edu/systems/apps/puppet and the CVS repo for puppet should not be used.

If you're interested in using our SVN repository, contact Kennon Kwok at kkwok_at_ncmir_dot_ucsd_dot_edu.

Gliffy Plugin Installed

I've installed the Gliffy plugin, which allows you to create diagrams, flowcharts, floorplans, network diagrams, etc. from within Confluence.  I put a quick, easy example here .  For better examples, see http://www.gliffy.com/examples/flow-charts/

new cobbler server is online

I've relocated the installation server from my desktop (dev-234) to a production vm (cobbler.crbs.ucsd.edu).  Cobbler services have been turned off on dev-234.  The iso location has been updated in confluence to http://cobbler.crbs.ucsd.edu/cobbler-ks.iso

Bamboo Server upgraded

The Bamboo build test server, bamboo.crbs.ucsd.edu, has been relocated to an upgraded server at SDSC.  Back on the "old" server, both the bamboo and bamboo-dev services were disabled.

I installed Apache Directory Studio on dev-central.crbs.ucsd.edu primarily to allow us to import .ldif files when we are on wireless or at home.  It required X to be operational on the server where it was installed.  I also installed NX for speed.  To run it, execute /usr/local/ApacheDirectoryStudio-linux-x86_64-1.4.0.v20090407/ApacheDirectoryStudio

It's setup to hit production LDAP.  It will prompt you for the password when you try to connect.  Please, DO NOT SAVE THE PASSWORD!

 - Vicky

Agenda:
More fun with DNS and LDAP

  • Atlassian domain change testing
  • Hook dev-crowd and/or dev-xwiki to LDAP

Present at the meeting:

  • Kennon Kwok
  • Sam Lee
  • Larry Lui
  • Gerry Matthews
  • Edmond Negado
  • Bao Nguyen
  • Sean Penticoff
  • Vicky Rowley

Notes:

Confirmed problem with Atlassian apps and the domain change. Server (confluence?) will move to the neuroinformatics.org machine.

there are directions here:

Authentication with xwiki or other webapps:
http://jira.xwiki.org/jira/browse/XWIKI-2496

that Sam will try and follow to get dev-XWiki working with Kerberos and ldap