Skip to topic | Skip to bottom
No permission to read topic WebTopBar - perhaps you need to log in?
Bit



No permission to read topic TWikiGuestLeftBar - perhaps you need to log in?


Start of topic | Skip to actions
Wednesday Nov. 09, 2005

Present: Fernando, Roger, Jody, Konstantin, Jerry, Jiong

  • Fernando reported on discussion with the IS dept. (Ross and Ladie) about potential involvement of IS in management and administration of the HPSCC. We are negotiating with IS on the details and level of their involvement. Ladie and Ross will be meeting with Scott to get his views on the matter. To date Ladie has spent ~ 3 hours talking with Fernando and Jerry/Fernando getting introduced to the architecture, configuration and upgrade trajectory of the HPSSC. Fernando briefed Diane Griffin on the developments.

  • Desktop backup: All backups are current except for one, which Jerry is dealing with. There are still MMI machines that have had the retrospect backup client installed, but have not been added to the to-do list on the backup server. Jerry promised to have these machines added by Friday. We are running out of disk space on the backup server. Jerry will look into two possibilities: i) Can we simply swap out some of the internal disks and replace them with higher capacity disk, ii) Purchase of an external Xserve RAID, base cost for 1TB ~ $6K.

  • Upgrades (current & future)
    • An upgrade GFS 6.0 -> GFS 6.0.2 was implemented. We now have a support contract from Red Hat for the 5 servers in the GFS cluster. An upgrade was recommended by Red Hat to solvethe kernel panics that started after we made GFS parameter changes that were also recommended by Red Hat. Performance of the file system since the upgrades has been excellent. Some user feedback: Brian Caffo reports that he is now able to transfer a large number of large files from his machine to the SAN, without seriously impacting the performance of the machine. (Previously he could not do this without stalling the system comprehensively) Rafa reported that he still experiences occasional latency (possibly when Brian is doing his big file transfers). Fernando has experienced this as well, but overall the latency is now very acceptable.

    • An upgrade SGE 5.3 -> SGE 6.0 was implemented. SGE was upgraded so that we could finally implement more powerful resource allocation policies. To date we are only controlling the number of jobs per user. (8 jobs max), but this is a stopgap measure. Karl is complaining that this is annoying in the middle of the night when most of the cluster sits idle. Faculty agree that this is not a good policy. Jerry committed to implementing a strawman policy that used the ticketing system. Fernando and Jerry will discuss the structure of the policy before it is implemented.

    • We also made enigma an SGE submission host, so it is no longer necessary for users to log into usher for anything. The message of the day on Usher and Enigma are still not updated to reflect this. Jerry/Jiong will change the message to encourage people to use Enigma.
    • We noted that the large memory machines are the ones that receive the most use. Jerry will look into the cost of increasing the memory on some of the 2GB machines or 8 GB machines.

  • ~15 MMI students requested access to home directories on the file system using SAMBA. We discussed the pros and cons. There were several concerns among the faculty. If we let MMI students do this, then we need to let everyone do this. Technically, there seemed to be no problem in actually implementing a small number of shared drives. It was pretty much a question of too many unknowns including: 1) Unknown administrative burden of keeping this going as a standard service for ~50-100 users, 2) unknown scalability of NFS/SMB on Masque, 3) unknown security concerns of letting users connect any laptop to the file system. Fernando will raise the issue at the MMI faculty meeting and propose an experimental roll-out of a small service for MMI labs, rather than a completely open service to all students.

  • We decided to change the meeting times to the 2nd Friday of every month at 9am in the Genome Cafe. (The space has already been reserved).

Wednesday Sept. 15, 2005

Present: Tom DiGiacinto, Jody, Jiong, Brian, Roger & Fernando

  • Athena status
    • Recycling has been called. The hardware should be out of the server room in the next week or so.

  • GFS status
    • GFS is working very well. Latency is much lower. Backups are no longer hanging. Logins are not stalling unless we do a huge i/o job. Even when stalls occur they are not long-lived like before. We now have a support contract from Red Hat which we will use to do the final optimization. Support coverage is 12-hours/day, 5 days/week with guaranteed 4 hour response. Cost is ~ $12.5K for 5 nodes.

  • NFS status
    • The solution of the global problems with GFS revealed that NFS on Usher is slow. It looks like the "out of the box" configuration parameters are very suboptimal. The cluster head node is doing nothing but NFS serving and it is still underutilized. Jerry/Jiong will bump up the number of allowed NFS processes to 24 and reboot usher. This should further improve file i/o and reduce latency on Usher.

  • Maintence
    • We decided to institute regular maintenance shutdowns every other Monday so that users can plan their long jobs appropriately. The first regular maintenence shut down will occur next Monday and will update the number of NFS processes on Usher and Masque.

  • Genome Cafe
    • R not installed on imacs in the genome Cafe. Also the time-of-day not right. Jerry will fix these.

  • Biopile
    • Brian noted that the windows manager under X on one of the iMacs in the genome cafe was "old". Tom suggested how to change the .bashrc so that it would use a different window manager.

  • University of Kentucky
    • Jody attended a management conference and talked to the IS folks at the university of Kentucky. She said they would be happy to talk to us about how they manage their multi-platform environment.

Thursday June 8, 2005

Present: Jiong, Roger, Rafa, Brian, Fernando, Benilton & Jerry

  • HPSCC status
    • It appears that some "problems" with the cluster are actually due to a lack of user education. In particular, we discussed several complaints of flaky behavior and concluded the basic problem is user education. In particular: (1) A student report that 'qsub' was refusing to submit jobs, even after several days, was not consistent with the experience of the faculty. It turned out that the problem was that the student did not understand the error message that the job was generating. (2) Reports that jobs have to be submitted several times before they run, appear to be traceable to users not appreciating the memory requirements of their jobs.
    • A real problem appears to be that rsync or scp hangs the file system. Evidently this is a problem at other sites as well. Brian volunteered to perform some experiments to see if we can recommend a work-around, before we consider restricting the use of rsync or scp.
    • Roger will start a FAQ to help with user education.

  • Backup status
    • We reviewed the the status of each and every customer on Jody's backup list. We cross checked this list against a report from the backup server, which we used to determine who was already being backed up. The customers fell into three categories:
    • Retrospect client installed and backup was run -- only about half a dozen biostat users are actually being backed up.
    • Refused backup -- Rafa will interview the users who refused backups to determine their backup plan
    • Restrospect client installed but restrospect server not updated -- Either Jerry had problems resolving the client or else there was a miscommunication with between Jerry and Jiong.
    • We requested that Jerry and Jiong cross-post their communication to bitSupport@jhsph.edu. The process also revealed that some users in MMI had fallen through the cracks. Jerry is working to restore the backup capabilities of these MMI users. Henceforth, a review of the retrospect backup report will be a regular part of every BIT meeting.
    • Tape drive is broken, so we are not generating tape archives to send off site. Jerry is getting box to ship drive back to Exabyte.

  • Web server: Action items from previous BIT meeting are still outstanding: (1) move biostat web server to blacknet, (2) get the web server on the retrospect backup plan. Jiong will submit a request for a blacknet application immediately. Jiong will install retrospect client on web server immediately.

  • Decommissioning Athena: Rafa will post the schedule. We will go to Ross to ask for advice on how to dispose of Athena. Jiong reports that Sun will not take a tradein for downgrades , only for upgrades. Ebay is a possbility.

  • NFS/SMB/X11 services for windows machines: Brian and Fernando spent about an hour yesterday trying to get Karen's machine to properly put up an X-window. Looks like Hummingbird Exceed is not properly configured. Jiong thinks he knows what is mis-configured and will change Karen's settings today. If this doesn't fix the problem, another possibility is for Karen to upgrade to Windows XP (she currently runs Windows 2000). There is no reason why the commercial solution shouldn't work. SMB/NFS services will be run from a dedicated server (temporarily Masque) so that these services do not get interupted when the cluster, post or enigma have to be rebooted. Jiong will install NFS and SMB on Masque, then he will shot down NFS on enigma and see who complains. Jiong will set up SMB authentication for Biostat users, Jerry will set up authentication for MMI users when he free's up.

  • New business: Apple safety recall & student education for Excel Access and Stata. Unanimous consent that that these are not BIT issues.

  • Upgrade plans: Deferred until next BIT meeting

Thursday Apr. 21, 2005

Present: Jerry, Benilton, Brian, Rafa, Roger, Fernando

  • Exabyte tape drive still not talking to the backup software. Jerry reports that Exabyte insists that the problem is not with their drive. Jerry ordered new SCSI card to test whether the card is the problem. The card is here. Jerry is testing the backup system using "norm," rather than taking down the cluster. No backups since the problem started (~4wks). Instructed Jerry to backup the Biostat partition onto the MMI and Epi partitions. This provides a temporary backup, but no off site backup is possible until tape backup is working again.

  • GFS status reviewed. No sign of the latency problem. Users such as Aidan very pleased with performance and stability. Students, e.g. Benilton, Sheng Luo pleased with performance but still concerned about occasional glitches. Glitches seem to fall into two classes: (1) Users kicking off jobs that stall the file system. Jerry recovers by killing the job. (2) Mystery problems that require a GFS reboot. Hopefully separating the storage cluster from the compute cluster (see below) will improve the stability of the storage system.

  • In principle, we could spearate the storage cluster from the compute cluster right now (Paracel finally sent the riser cards and Fernando has a 24-port GigE unmanaged switch that he could donate temporarily until the Paracel switch arrives. But we will hold off on the new architecture until the tape backup problem is solved.

  • Paracel still hasn't sent us the agreement to waive the final year of hardware backups. No response to direct queries about this. They did send the riser cards and the replacement for the broken compute node (which Jerry installed yesterday).

Thursday Mar. 3, 2005

Sorry, too busy to write minutes. -- FP

Thursday Feb. 3, 2005

Present: Rafa, Jiong, Jody, Benilton, Tom, Jerry, Roger and Fernando

  • Problem with Low Humidity alarms in the Server Room continues. Jerry and Jody will coordinate with facilities to get control of humidity in the room.
  • Replacement PDUs (Power Distribution Units) are coming from Paracel, but we don't have a clue when they will arrive. We resolved to place an order for dumb PDUs from APC in order to get the 11 nodes that are currently off-line back on-line ASAP.
  • GFS continues to be a problem. Lots of latency in logins and ls commands, especially on weekends. There's not too much that Jerry and Jiong can do until they take the RedHat? Storage course. In the mean time Jerry will query Micah to see whether he can come out on a paid basis to upgrade us to the latest version of GFS.
  • Training: Jerry and Jiong will both take the Redhat Storage course in Columbia in May. Jerry and Jiong will also take the Sun Grid Engine course.
  • Some problems with Sun Grid Engine input queues. Evidently some users are waiting for up to 5 hours before their jobs are allowed in. It turns out that some users were advised to submit as many jobs as possible in order to stress the system. OK it's stressed. Resolved to advise users to not submit more than 12 jobs at a time. Also resolved to start looking at SGE's policy management features so users can submit lots of jobs without clobbering each other.
  • Reviewed Bit committee Todo list.
  • Resolved to create email list for support help calls. Jerry and Joing will respond to help calls, Fernando and Roger will monitor the list.

Thursday Dec. 16, 2004

Present: Rafa, Jiong, Jody, Roger, Ingo & Fernando

  • Briefly discussed continuing GFS latency problems. The consensus is that it is associated with large file activity. Fernando is developing protocols to characterize the problem. Results will be posted on the SystemOptimization page.
  • Discussed Training schedule for Jiong and Jerry. Jerry is taking Linux training this week, but missed the deadline for the GFS course. The next local course is offered in May.
  • Most of the time was spent reviewing the to-do list which can be found on the TodoList page.

Tuesday Nov. 16, 2004

Present: Jerry, Brian, Rafa, Jiong, Jody, Roger, Ingo & Fernando

  • Discussed enigma usage and whether there were biostat users who were still using Athena as their primary machine. Decision taken to encourage remaining users to fully embrace enigma. Jiong and Rafa will send out an email to the Biostat department.
  • Storage issues on the new system
    • Still some latency occuring in the GFS system. May be related to transfer of large files.
    • Aidan reports that he cannot write files that are larger than 2Gig from the 2Gig slaves on usher. Jerry suspects it's related to the cache size.
    • Some reports of latency for windows machines NFS'ed to enigma.
  • Commitee wants to know what will be the final disposition of the computing lab (Genome, or general computing?)
  • Mosix will be removed from Biopile in favor of PVM-based system (need to check with Epi)

Tuesday Oct. 12, 2004

Present: Fernando, Jerry, Brian, Rafa, Jiong, Jody, Roger, Ingo & Tom Digiancinto (in place of JP)

  • We reviewed the status of backups.
    • Athena -- Jiong reported that offsite backups are happening.
    • MMI desktop backups -- Jerry reported that MMI tapes were not going off site, but this will change immediately.
    • Biostat macintosh desktop backups -- We decided to have the IS department do the macintosh backups in the Biostat department. This system is a clone of the MMI backup system and is managed by Jerry. The advantage of using the IS system is that they would continue to provide support in the event something happens to Jerry. Also, they would be responsible for installing clients in the machines to be backed up.
    • SAN backup -- Arkeia software is now proven (actually used to recover the MMI partition). The Arkeia demo license has expired. The last backup was on Wednesday. A license has been ordered (post meeting update: Arkeia will expedite shipment of the registratoni keys if we show that the order has been made). The backup schedule for the SAN is as follows:
      • Backup schedule:
        • Weekly full backups with daily incrementals. All to tape.
        • Monthly "archival" backups, kept on site.
      • Weekly backups will be on a 3 week rotation schedule:
        • Current full is kept on site, previous two are rotated off. For each returned set of tapes, another set will be sent off in its place.
        • The tapes sent off site are -only- from the full backups. Incrementals will stay on site.

  • Fernando reported that Paracel's demise appears related to corporate restructuring at the top-level parent company (Applied Biosystems). The mass-spectrometry branch of the company (ABI) evidently also experienced a dramatic shakeup in their high-level management in the past several weeks.

  • Jiong and Tom reported that the Enigma cut-through should be ready by the end of the day.

  • An extensive discussion occured over how best to provide desktop computing for the students. Rafa and Brian volunteered to write down the points that were made in the discussion and to make further recommendations based on additional feedback from the committee members. The recommendations and observations were to be forwarded to Jody & Scott.

Tuesday Aug. 31, 2004

Present: Jody, JP, Sean, Jiong, Roger, Ingo, Brian, Rafa & Fernando

  • Brian stated that the response to the survey to determine web serving needs was not huge, but he did see a surprising number of responses indicating plans for more extensive dynamic content. The current departmental web-server configuration was discussed (NFS mount of Athena file system on a second machine that acts as a web server). We discussed the need for a safer web publishing configuration and the need to separate development from production machines. Jiong and Roger will look into how to do this by using e.g. cvs, rsync or something along those lines. They will develop the solution in the next three weeks.

  • We reviewed the status of the paracel roll out. Micah from Paracel completed the GFS installation over the weekend. Front-1 hung while he was testing it over the weekend, but he did not believe that this was a GFS problem. At the meeting Jiong reported that Front-1 had hung over night and Jioing was unable to "unfence" GFS on Front-01 when he rebooted it. Micah had already left, Jerry was at home and had not been contacted. Jiong was going to contact Jerry & Paracel for further support. (Note: GFS was promptly unfenced by Jerry once he was made aware of the problem . Hanging problem was Rocks problem caused by the fact we installed Rocks on both front ends. Micah had warned us about potential problems. -- FP Sep. 2).

  • Jody suggested that we gather software version and vendor information for the various machines in the system. The concern is that we have not documented exactly what is not supported by Paracel. This would also help Jerry and Jiong to administer the system.

  • Brian mentioned that PVM was broken on Biopile and requested that Jerry take a look at it.

  • We agreed to meet in 2 weeks to discuss Biostat computing planning items that Jody wants to bring to the committee.

Thursday Aug. 5, 2004

We discussed the status of the Paracel rollout. Paracel is working hard to get the Redhat "Global Files System" installed. The latest update is that Paracel discovered a bug in GFS. The bug is specific to systems with 64-bit opteron processors. The fact that we hit this bug, certainly demonstrates that our system is on the bleeding edge -- I guess no one ever tried to do what we're doing using opteron systems! Anyway, this explains why Jerry failed in his attempt to install GFS -- the cards were stacked against him. Paracel and Redhat engineers are now collaborating to fix the problem.

Fernando and Jerry reminded everyone that it was in the plan from the beginning for the cut-through to be migrated to the new Athena ASAP. Fernando will resend the powerpoint viewgraphs showing the rollout plan that we discussed back in May. The more detailed plan was hashed out in June(between Jerry, Jiong and Fernando). and can be found below.

There was a disagreement as to whether a plan existed for the departmental web server. It was decided to conduct a survey of the faculty to determine: (1) what are the current web serving needs of the department and (2) what is the trajectory for web serving needs. We will use the information to make a hardware/architectural recommendation for web server support. Brian Caffo is in charge of doing the survey.

Thursday Jun. 3 2004

The meeting today is cancelled because Fernando quadruple-booked himself for the 11am-1pm time-slot. The purpose of the cancelled meeting was to discuss the migration plan to the Paracel machines. Jerry, Jiong and I met yesterday to hash out some more details relating to the migration of users after the machines are installed by Paracel. Here is a summary of the current status and rollout plan. Please let us (Jerry, Jiong, Jody or myself) know of any concerns or recommendations. Also, please feel free to edit the plan as you see fit.

Current status

  • The machines are here (!) and sitting in the MMI machine room. Paracel personnel will install the machine as soon as the machine room is turned over to us.

  • There is a tentative install date of June 15 - June 18. This is tentative pending the machine room being turned over to us.

  • The machine room is almost ready to go. We tested the A/C last week. The A/C produces a tolerable low-frequency rumble in the computer lab. Jody is in charge of finding out about a service contract.

  • The electrical contractors will be finished this week they are installing outlets and pulling ethernet cables as we speak. The remaining delay is that a hole needs to be drilled in the concrete floor of the machine room for ethernet cables. Once this is done,the room can be turned over to us for installation of the machines. Floor may need to be X-rayed for this to be done. The drilling hardware is on-site and I've been told this may get done as soon as the end of this week or next week.

  • Paracel will leave us with working machines that will be configured by Jerry and Jiong. They will get the machines up and running in an order constrained by the need to have a maintainable SAN and the urgency of (Francesca and Aden's project). The order of the user/machine migration is as follows:
  1. SAS machine (Jiong)
  2. New Athena (Jiong)
  3. New cluster (Jerry)

The SAN

  • The 4TB SAN will be partitioned between Epi, MMI and Biostat according to the 20%-20%-60% fractional ownship formula:
    • 2400 GB Biostat
    • 800 GB MMI
    • 800 GB EPI

  • The Biostat 2.4 TB will be further partioned into a 300GB and a 2100GB partition. The purpose of the 300GB partition is so that the SAS machine can be gotten up and running immediately without having to wait for the new Athena to be set up. Partitioning of the SAN will be handled by Paracel, but Jerry and Jiong will be present.
    • Jerry and Jiong have been discussing further partitioning the 2.1 TB to keep with the current convention of separating project data space from user home directories.

  • Users will find their data under a different absolute path than before, (e.g. /home/biostat/..., /home/mmi/..., or /home/epi/... Jiong and Jerry should be meeting as we speak to develop a maintainable directory structure for the SAN. Users may find it necessary to modify scripts that have hard-wired absolute paths.

  • In the interests of speed, the SAS machine will be set up with it's own 300GB partition on the SAN. The machine should be usable within one to two weeks after the Paracel install.

Athena

  • Migration of Athena users will be handled by Jiong. This will happen in a staggered manner that will minimize downtime for individual users.

  • Jiong will migrate user accounts to the SAN and configure the new home directories.

  • Users data will be moved one mount-point at a time. The following are the individual mount points that will be moved:
      /users/faculty
      /users/faculty1
      /users/faculty2
      /users/faculty3
      /users/student
      /users/other
      /users/center
      /users/project
      ...
      /users/project10

  • Only users in the mount-point that is currently being moved will experience a service disruption that should last no more than 1/2 day.

  • Users will find their data under a different absolute path (e.g. /home/biostat/..., /home/mmi/..., or /home/epi/...). Jerry and Jiong will come with pros and cons of various organizations)

  • Old Athena data will be moved to the SAN and remain transparently accessible (via NFS) to old Athena users. Home directories will NOT be wiped and data will remain under a different mount-point (read-only?) just to give users warm fuzzies. Beta-test users will be allowed to log into the new Athena to test applications. 1 - 2 weeks.

  • Ultimately, only sunrays will be allowed to log into old athena. Old Athena goes away when sunrays go away.

The Cluster

  • Jerry will handle the cluster migration. The bulk of the work is associated with migration of Epi users from Biopile to the new cluster. Migration to the new cluster will proceed in a similar manner to the Athena migration.
    • Installion/Configuration of software on the new cluster
      • Epi-specific (lots of stuff)
      • Biostat specific (e.g.. R)
      • MMI specific (e.g. Paracel Blast & databases)
    • Migration of user account information for the Epi accounts (or create new accounts)
    • home directories in /home/epi/ which is mounted on SAN.
    • Copy user data from BioPile Epi accounts.

  • We will hold a Sun Grid Engine (SGE) usage class for users of the new cluster.

Thursday Feb. 5 2004

The meeting was cancelled, but Fernando prepared a report: summarizing the status of the current Biopile as well as where we stand on getting a new machine. The highlights are as follows:

Friday Oct. 7 2003

  • Present:Brian Caffo, Roger Peng Ingo Ruczinski, Jerry Gilyeat, Rafael Irizzary, Fernando Pineda, Jiong Yang, Aiden McDermott

This was a continuation of yesterday's meeting. Aiden questioned the wisdom of focusing on 64 bit archituctures. Discussion also focused on whether we should have a separate SAS machine. After the neeting, Roger volunteered to try to compile R in 64-bit mode on Athena. He demonstrated that it works in 64-bit mode and does indeed address more than 4Gig without choking. Here is his complete report. We decided to recommend ordering a single 64-bit node to test compatability. We will consider testing SAS on this machine as well, if there exists a suitable 64-bit license.

Thursday Oct. 6, 2003

  • Present:Brian Caffo, Roger Peng Ingo Ruczinski, Jerry Gilyeat, Rafael Irizzary, Fernando Pineda, Jiong Yang

Discussion focused on a response to Scott Zeger's request for a recomended high-performance computing architecture. In particular, should we have a single machine, or multiple machines. Consensus seems to be that two machines are probably necessary. A 64-bit computing cluster, and an Athena-like file server that would also be used for general computing. Several key issues arose:

  • software compatability with 64-bit processors -- What compatability problems are we likely to encounter with commercial and open-source software software
  • license issues for, e.g. SAS, Matlab -- Do 64-bit versions exist for these codes and what are the licensing issues?

Consensus was that we should focus on 64 bit architectures -- probably AMD's Opteron's due to their better compatability. 1Gbs interconnection technology appears to be at the sweet spot and is already included in most vendor's machines. We decided it would be a good idea to buy a 64-bit box to test the compatability of our software. Also decided to continue the meeting the following day.

Thursday July 10, 2003

  • Present: Ingo Ruczinski, Jerry Gilyeat, Fernando Pineda, Jiong Yang, Chris McCullough, Sean Prigge

  • Jerry presented a report summarizing technical issues to consider for the next generation cluster. The report focussed on four topics: (1) cpu technology, (2) interconnection technology, (3) Linux distribution and (4) potential vendors.
    • CPUs: Biopile uses 32-bit Xeon CPUs. 32 bit technology is nearing the end of its lifespan. Competing 64-bit chips are Intel's Itanium and AMD's Opteron. We should switch to 64-bit technology for our next cluster. One disadvantage of the Itanium is that 32-bit performance is poor, so legacy applications might not run as well on it.
    • Interconnect technology: Biopile uses 100Mbs connections. GigE (1000 Mbs) is now the "standard" technology for intranets. GigE appears to be at the price/performance sweet-spot. Swithces are reasonably priced ( $5K for 24-port switch?) because of the economies of scale associated with the intranet/internet market. There exist more exotic technologies (e.g. Myrinet) which are optimized for cluster applications, but these require special purpose interface cards and special-purpose switches. These are more expensive than GigE (for example a Myrinet interface card is over $1K compared to "free" for GigE -- since GigE interfaces are already on the motherboards!) because the cluster market is smaller than the intranet/internet market. This is not likely to change too much in the next year.
    • Linux Distribution: Biopile uses Redhat. Redhat does not yet support the Opteron but does support Itanium. Linux itself has been ported, it is the distributions that are catching up. SusE and Mandrake currently support the Opteron. This may change in a year.
    • Vendors. Biopile came from a small "white-box" vendor. Jerry compiled a long list of potential vendors. Dell is one of the biggest, but they currently do not sell a 64-bit cluster (except as a custom-built machine). This may change in a year.

  • There was a discussion on whether we should upgrade the cluster. The consensus was that the machine was far from saturation and that until there were more users, or more jobs running on it. It didn't make sense to upgrade. Two experimental experimental upgrades were floated: (1) a GigE switch, (2) a couple of 64-bit compute nodes. There was concern that this would result in a heterogenous system and that it might cause some performance degradation for naive users.

  • Fernando reported that the ftp problem, which precluded us from connecting to ftp.ncbi.nih.gov, was solved by Ladie and others in the IS department.

  • Fernando reported that the inventory of the MMI personal computers was finished. Not everyone responded, but we have enough data to make some conclusions for the MMI self-study.

Thursday May30, 2003

  • Present: Ingo Ruczinski, Rafael Irizzary, Jerry Gilyeat, Fernando Pineda, Art Giovanetti, Jiong Yang

  • Fernando summarized the discussion at the CIT meeting. Key points were that the CIT approved the initial role-out of the blacknet. For the moment the role-out will procede using the current model of distributed data centers. The CIT discussed various cost recover approaches, but nothing was settled on.

  • Rafa requested some scripts to update his version of 'R'. Jerry will provide him with these.

  • The status of off-site backup was discussed. Jiong was charged with delivering the tapes to Ladie. Art said he would ping Ladie about sending out regular reminders to sysadmins. Jerry will have back-up scripts ready in 2-3 weeks.

  • Jiong and Jerry have been unable to do NSF mounting from biopile. They suspect network problems but have not completely eliminated the possibility that the problem is with biopile. The will do some tests to eliminate this possility and coordinate with Ladie if required.

  • Jiong and Jerry will be attending the ClusterWorld conference in San Jose this June.

Thursday May 1, 2003

  • Present: Al Scott, Ingo Ruczinski, Rafael Irizzary, Jerry Gilyeat, Chris McCullough, Fernando Pineda, Jiong Yang

  • Offsite backup for Biostats scheduled to be launched in 2 weeks. Action item: Chris will order more tapes and will ask Ross about tape return policy. Offsite backup for MMI will start when we order a tape drive. Action item: Jerry will spec-out a tape drive.

  • Rafael's server was approved for the blacknet. He, Jerry and the current sysAdmin will get together to discuss transfer to the W5307 server room.

  • SPAM filtering doesn't work very well. Action item: Jerry will talk to Ladie.

  • NCBI BLAST is up and running on Biopile. A new disk to hold local copies of the NCBI databases was ordered and installed. The url for the local blast is http://biopile.mmi.jhsph.edu/blast (all lower case). A problem that was uncovered while installing BLAST is that we can't access the NCBI ftp servers from behind the jhsph firewall. Jerry contacted Ladie a week ago. Fernando will follow up.

  • Jerry will be testing software solutions for the MMI/MRI web server which will make the server easily usable by Microsoft Access and ASP users. The following summarizes what he has found so far: "After a bit of research, my recommended solution (pending full product evaluation) is to use SunOne ASP from Sun Microsystems instead of Apache::ASP or iASP. The cost of the software is approximately $635, including support, media and documentation. Hardware upgrade costs for an additional GB of RAM and a second 2 GHz CPU for the server are 497.17 for the processor, and 1 GB of RAM is 331.96, all from Dell. These hardware costs are recommended regardless of which ASP solution we use. Total recommended cost is approximately $1500. Evaluation versions of the software packages are free, as well. For the evaluation, I would like to work with both Bin and Konstantin to make sure that their needs are met with respect to functionality and features. I'd like to kick the evaluation and testing off as soon as possible. Testing will involve the installation of two additional Webserver daemons onto the server, independant from what's already there. I will then install iASP and SunOne onto each, so we can do independant evaluations of the performance and overall featureset of each. As part of the testing will involve talking to the database server, the IP address, and ports opened, will be needed."

  • Disk space on Biopile is getting tight. We decided to order another disk.

  • A one/two-year plan for developing a new computing infrastructure was put on the table for exploration and elaboration. A key requirement for biostats is that users be able to mount their file system from a fileserver and be able to run jobs interactively. It is also important to support a capability like the current sunRays for the students. Right now it appears that these requirements are performance killers for a big cluster. Thus we will explore the possibility of having two clusters. The current biopile would serve as the new Athena . It would be expanded to 16-32 nodes and be used as a the application/file server for interactive login sessions. The sunray lab would be replaced by a clone of the linux lab in Hapton house. (The Hampton house Linux lab was set up at a cost of ~$1.2K per workstation). These workstations would be networked to the biopile which would serve as the file-system/application server. Unlike the sunrays, these could be placed anywhere in the building. Getting the Athena/Sunray replacement up and running should be straight-forward since we would be leveraging efforts already expended in developing Biopile and the Hampton house linux lab. Finally, we would install a new cluster to be used exclusively for high-end batch computing. Presumably this cluster would have a high-end backplane (e.g. Myrinet or Gigabit ethernet).This system would have 64-128 nodes. Jobs could be submitted from BIopile using a batchqueue. To users of Biopile it would look as if they were merely submtting their job to a queue on Biopile, but in fact they would be going to the SuperComputing cluster.

Thursday April 3, 2003

  • Present: Brian Caffo, Ingo Ruczinski, Rafael Irizzary, Jerry Gilyeat, Fernando Pineda

  • The main topic of discussion was the status of the Blacknet which is up and running out of one closet on the 5th floor. Two MMI servers are already on the Blacknet and are located in the temporary MMI/Biostats computer room (W5307). (These minutes are being served off one of the machines.) The IS departmented added 6 new network drops to the room. There are now 14 100BaseT network drops in the room.

  • We discussed moving Aiden's server and Rafael's server into W5307 and getting them onto the blacknet ASAP. Jerry and Rafael were going to discuss, with Aiden, the possiblity of putting Rafael's database application onto Aiden's machine, so as to minimize the number of separate servers.

  • Reviews of the BioPile cluster continue to be good. The version of R that Jerry optimized for BioPile runs approximately 4 times faster than on Athena. The need for an accounting system was raised again to measure the usage of the system.

  • After the meeting we took a tour of W5307.

  • After the meeting Chris McCullough called Fernando and mentioned that Scott Z. had approved the off-site backup protocols and that we should present these at the next Biostats faculty meeting



Bit.MeetingMinutes moved from Bit.Minutes on 19 Dec 2004 - 16:55 by FernandoPineda - put it back

You are here: Bit > MeetingMinutes

to top

No permission to read topic WebBottomBar - perhaps you need to log in?