Planet Site5: Bringing Site5 Team Bloggers Together
Posted by David Felstead (Senior Engineer) at June 22, 2006 01:24 AM.
Anyone who has been around Ruby and Rails (or any new software technology) for a while has seen or heard the question asked countless times: Does it scale?
Inevitably the one who is asking the question is referring to the scalability of the software framework, asking whether it can be easily expanded to accommodate copious amounts of requests and traffic. The question itself is one of those ones that can’t be answered simply, and no doubt the Rails afficionados will be rolling their eyes at the same question being raised yet again, as the vast majority of applications written with Rails will never grow to the size that will require them to scale. The Site5 Engineering team’s recent work on our new server monitoring and task management system Squire has made us have to look more closely into the scaling issue, and not just on a technological basis. With the massive growth that Site5 has been experiencing in recent months, it seems that some of the ways we used to take care of things with regards to server management just weren’t going to carry us through into the future.
The crux of my argument is this: when there is enough growth to warrant a re-evaluation of the scalability of an application, chances are that you’re going to have to re-evaluate your business processes in a similar way. Luckily, the Site5 Management team recognizes this, and have planned accordingly – in fact, most of the engineering team’s effort is being poured into future-proofing our fleet.
Site5 now has hundreds of new customers joining us every month and new servers being added all the time, so tasks that were once very simple for our support staff and system-admin gurus start to become more difficult. The off-the-shelf systems we use to monitor our server fleet start to become inadequate – they don’t provide enough detail on the sources of any issues occurring, and often require manual intervention from support staff to resolve. Though most issues take only a few minutes to resolve, a few minutes multiplied by a few hundred servers starts to become a big drain on resources. It might have worked before, it might still work today, but it won’t work in the future.
So enters our new server monitoring system Squire. This neat little piece of software (written completely from scratch by we of the Site5 Engineering Team) is a purpose built web-hosting monitoring system. It’s closely integrated with our hand-built Synco CRM system and even with Site5 Backstage to proactively monitor and gather detailed statistics on each machine in our fleet. It automatically detects and resolves common issues on our machines and instantly notifies the support team when a problem that can’t be resolved automatically is encountered. In addition, it provides the customer service staff and support staff with quick links to customer information should customers need to be contacted in the case of problems. In fact, we have a few little Backstage features in store to keep our customers that much more informed… stay tuned!
Our system administration and customer service teams work extremely hard to keep our server fleet healthy and our customers happy. With Squire we hope to not only make their lives easier, but also to keep our customers up to date, informed, and, most importantly: happy.
Operations and Tactics - Blindsided with vengeance — heat is our nemesis
Posted by Todd Mitchell (Chief Operations Officers) at June 03, 2006 05:57 AM.
The last month has been one hell of a roller coaster. Late nights, all nighters, travel, weeks away from my family, upset clients and we’re still not done. Why all of a sudden are things off kilter? Heat. Heat in a data center and within our chassis is the reason.We have to go back to the first week of May. All of a sudden and without notice we started to see machines fail at an abnormal rate. And within a week we had 15 machines fail–all primary hard drives. This means hours of downtime for our clients. Obviously after being blindsided within 15 drive failures in a week, we thought, a bad batch. Sometimes hard drives have errors during the manufacturing process and you get a bad batch. And then more drives started to fail.
Fast forward to May 10th. Adam (Site5’s CTO) and I decide to head to our data center in NJ. I leave at 19:00 and Adam leaves at 00:00. We both arrive around 02:00 ET. Both beat from the day and the travel we crash for the night and get started first thing in the morning.
We head to one of the data centers in NJ, scan our cards, get our faces scanned and in the door. Everything appears fine until we get to our cage where our cabinets and servers are housed. We find our cage abnormally hot and with insufficient cooling. Obviously both Adam & I had suspicions before we headed to NJ, but the discovery of excess heat solidified our initial thoughts.
Although hot, it didn’t seem like the ambient temperature could have caused the carnage we saw the week before…and then we opened one of our chassis (server cases). That’s when we found another issue. The chassis design has 4 critical flaws:
1. The full size (550 watt) Power Supply Unit (PSU) is located in the front of the chassis restricting cold air intake.
2. We had full size DVD rom drives in second front bay resulting again in reduced cold air intake.
3. The hard drives we use were stacked on top of each other at the rear of the chassis.
4. Insufficient fans within the chassis which resulted in less draw/exhaust.
After discovering these critical faults in the chassis design, we started to poke around the data center and stumbled upon another critical fault; this time with the air conditioning. Rather than using standard aluminum ducting in our cage, our data center used an air sock. If you’ve ever been to an airport you’ve seen an air sock. The air conditioning sock functioned in the same regard. Affix the sock to the primary air conditioning outlet and when the air conditioning is blowing the sock expands. The sock is then perforated to allow air flow to the cage rows. This, wasn’t working at all.
Now it was obvious to Adam and I what our issue is. Heat has very quickly become our nemesis (failing hard disks, PSUs and poor server performance). In addition to the air conditioning problems and the chassis design we also have a rack density issue. Essentially too many servers within a specific amount of square footage.
As you can imagine, at this point Adam and I are beside ourselves. Extremely irate with our data center for neglecting these issues. We end up calling a meeting with one of the data center owners and our sales rep to discuss the issue. As the discussion progresses we start to pull the facilities manager in, senior network operations guy, lead hardware guy and the room slowly fills. After many hours of discussion we come to many solutions:
- The air conditioning sock needs to go away. This was completed a few days later.
- We need supplemental air conditioning to help with our density. An additional 13 tonnes of air conditioning was brought in as of last week. There’s another 20 tonnes of air conditioning coming within 10 days.
- We’re going to reduce our density by half. This is in the works, but essentially we’re going to occupy twice as we do today with the same amount of equipment (servers and networking gear). This space has already been built out, we just need to figure out the logistics and wait on the new air conditioning unit.
- We’re going to change every single chassis. We’re going to rid ourselves of the poor chassis design for a new design with much better cooling and hardware placement. These chassis have been ordered and should be delivered next week. Once the chassis are at the data center, we’ll schedule maintenance and dispatch a maintenance notice. Downtime for the chassis swaps will be less than 20 minutes.
Within the next few weeks we’ll have all of the abovementioned issues permanently resolved. This has been a tremendously eye opening event for all of us at Site5. Clearly this is the very type of situation we will avoid at all cost in the future and for the clients that have been affected by these issues (hardware failures) we sincerely apologies for the issues and as you can tell for the plans outlined above, we’re doing everything within our control to permanently resolve the outstanding issues.
For all of our new clients, you will not be affected by these issues. All new servers are equipped within our new chassis and are being cooled by the supplemental air conditioning until the new 20 tonne Liebert air conditioner is installed.
I’ll continue to post updates about this issue and other ongoing issues as the information becomes available.
On behalf of Site5, I’d like to thank all of our clients for their patience and continued patronage. We truly appreciate your business.
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at May 27, 2006 10:35 PM.
Several people who have contacted me for assistance lately have indicated that they were hesitant to do so. I just wanted to post a quick note saying that I'm here help. I don't mind helping with development questions, looking into long standing issues, or generally helping where I can. I'm pretty good about getting back to everyone that e-mails me, but sometimes it takes a day or two (and most requests can be taken care of much more quickly by our support staff, so you should usually seek help from them first). In addition to hearing from our staff (on the behalf of our clients), I really don't mind working with Site5 customers directly on issues that they are having.
One thing to be aware of, however, is that I don't actually directly oversee support on a day-to-day basis at Site5. Customer Service staff fall under the leadership of our COO Todd Mitchell. He and I work together very closely to implement the plans laid out by the management team as a whole and keep things running day to day.
Basically, I'm just trying to say that our customers shouldn't hesitate to contact me, I really am happy to help where I can.
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at May 25, 2006 01:22 AM.
May has been a pretty busy month for me. I spent the majority of the month in New Jersey; on location with one of our vendors (along with Todd) to make sure that a rash of drive failures we’ve suffered lately is put to an end once and for all. I was once again reminded why living out of a hotel room (even if it is a nice one) stinks. Jersey wasn’t a total loss, because in addition to working out issues with our vendor, Todd and I also got the time to sit down and work up a schedule for operations projects that need Engineering Team attention.
Today, with a pretty substantial Backstage update, we start full speed down that path. In reality the release we pushed live today was actually finished before I left New Jersey, but the extra few days was spent waiting for support levels to even out and testing the changes we made. Now the team is hacking on the core incident/outage system that will be the heart of the new operations work we will be doing for a few months.
Posted by Todd Mitchell (Chief Operations Officers) at May 15, 2006 03:19 PM.
Things have exploded for Site5 over the last 12 months. Ranging from the number of exceptionally great clients, to employees and servers. We’ve gone from a mom and pop shop (think corner store where you know everyone’s name) to a serious contender in the web hosting industry.
That said, our support department is evolving. Previously we had 1 level of support. This level handled everything related to our machines. From front-end customer support to full on server replacements/restores. This worked extremely well when we had a smaller number of servers in our fleet and a smaller number of clients. Since neither hold true today, we’re changing to meet our clients needs.
First, we’re adding a second layer of support. Known as Level 3’s internally, these highly skilled Linux system administrators handle all complex/advanced fleet and network issues. How does this affect you? Well, first, the Level 3’s are proactive. This group handles all maintenance, security updates, software upgrades, etc.
The group is also reactive. If they see issues with a server or one of our networks, they move quickly on the issue. With ample pull at all of our vendors and service providers, having system admins dedicated to overall performance will better improve our overall service offering.
Over the next couple of weeks, you’ll see this group in full swing. From maintenance notices going out signed by them to quicker reboots and fixes when an issue does pop up. Many of you have already been in touch with Andrew Galloway, a great Sr. System Admin (Level 3) and once we finalize the other two positions, expect a whirlwind of performance improvements and overall stability across our fleet.
This shouldn’t take away from our amazing crew who handle all technical support requests. Our front-end system admins, the guys who respond to your support requests, also have a redefined roll. Rather than trying to carry too many hats, we’ve reduced their overall load. This will help ensure that all responses through our help desk as fast, efficient and accurate. We’re continually adding to this group (expect to see a few new faces over the coming weeks)!
If you’ve experienced our Level 3’s in action or simply want to comment on the above, I’d like to hear from you. Unfortunately I can’t respond to every single email, but I do read everything that comes into my inbox. tmitchell (@@at@@) site5.com.
Posted by Todd Mitchell (Chief Operations Officers) at May 07, 2006 11:05 PM.
Sales/Billing Reps - Remote / Telecommute
Site5 is once again on the hunt for some very exceptional people. If you’re looking for a challenging career in the web hosting industry, Site5 is the company for you. One thing to keep in mind–Site5 is not your average company. Site5 is filled with talented people and teams that like and want to work. Site5 is progressive, original and proactive. Site5 is exceptional and we’re in need of exceptional people.
How do we create such a dynamic and enjoyable work environment?
- We only hire great people.
- Our employees are dependable and reliable.
- We encourage learning additional skills.
- The management team actually cares.
Sound like a company you’d like to be part of? Excellent–because we’d love to meet you!
Our Sales & Billing groups at Site5 are an integral part of our success. Often times the sales & billing group are the first faces & interaction that clients & potential clients have with Site5. It is very important to us to ensure that these groups are on the ball and up to date.
Site5 has had exceptional growth over the last few months and we need to add to these groups! You will be responsible for handling all incoming requests from our clients. All incoming inquiries are queued and handled in a first come, first serve basis. So if you want to sell a service you believe in or have exceptional customer service skills and would like to work our billing group, you should apply ASAP!
Like most employees at Site5, you will sport many hats throughout your day. Some include, but are not limited to, Site5 evangelist, customer service guru and master of client satisfaction.
Our work environment is fast paced & progressive. The prerequisites for this position include strong troubleshooting / problem solving skill, impeccable customer service abilities as well as the ability to learn / pickup new applications/concepts/ideas at a very rapid pace.
Very strong communication skills, interpersonal skills are paramount.
Essentials to apply for our billing & sales groups:
- Experience on Service Level Agreement help desks is a plus.
- Ability to meet & exceed predetermined goals.
- Experience in the web hosting industry.
- Excellent written & verbal communication skills.
We are currently looking to fill the following shifts:
M-F 09:00 - 17:00 ET - billing group
M-F 08:00 - 16:00 ET - sales group
M-F 12:00 - 20:00 ET - sales group
Please send a text (ASCII) or HTML version of your resume to careers@site5.com with the subject line: sales/billing positions - Remote (WT040706-002). Please be sure to include your group & shift preferences.
Site5 Internet Solutions, Inc. is an equal opportunity employer.
Site5 does NOT accept resumes from agencies or similar. Please do not remit invoices to our contact email address, physical mailing address, fax or to any of our employees or contractors. Site5 will not be held responsible for any fees related to unsolicited resumes/correspondence.
Operations and Tactics - Linux/cPanel Systems Administrators - Remote / Telecommute
Posted by Todd Mitchell (Chief Operations Officers) at May 07, 2006 10:49 PM.
Linux/cPanel Systems Administrators - Remote / Telecommute
Site5 is once again on the hunt for some very exceptional people. If you’re looking for a challenging career in the web hosting industry as a Sr. Linux System Administrator, Site5 is the company for you.One thing to keep in mind–Site5 is not your average company. Site5 is filled with talented people and teams that like and want to work. Site5 is progressive, original and proactive. Site5 is exceptional and we’re in need of exceptional people.
How do we create such a dynamic and enjoyable work environment?
- We only hire great people.
- Our employees are dependable and reliable.
- We encourage learning additional skills.
- The management team actually cares.
Sound like a company you’d like to be part of? Excellent–because we’d love to meet you!
System Administrators at Site5 are an integral part of our success. Our administrators keep our site, our clients sites and our infrastructure in amazing condition. You will be responsible & part of a shift based team that analyzes and finds solutions for daily issues that arise in our server fleet.
System Administrators are also responsible for direct, front-end support for all of our valued clients. Our clients expect & deserve the best possible support when an issue arises with their account/service/server, etc. Removing level I/II help desk admins ensures that our clients issues are resolved quickly & efficiently, the first time around.
Like most employees at Site5, you will sport many hats throughout your day. Some include, but are not limited to, Site5 evangelist, customer service guru and master Linux sysadmin.
Our work environment is fast paced & progressive. The prerequisites for this position include strong troubleshooting / problem solving skill (Linux & basic hardware), a hardcore knowledge of Linux & networked environments, impeccable customer service abilities as well as the ability to learn / pickup new applications/concepts/ideas at a very rapid pace.
Very strong communication skills, interpersonal skills & scripting skills are paramount.
Essentials to apply for the Site5 Sr. System Administration position:
- 3+ years experience as a Linux Systems Administrator in a production environment.
- Experience on Service Level Agreement help desks is a plus.
- Experience in the web hosting industry.
- Proven technical troubleshooting experience.
- Experience with Apache 1.3/2, PHP 4/5, MySQL 4 or Exim is a plus.
- Excellent written & verbal communication skills.
- Hardcore Linux experience in a production environment is a must.
We are currently looking to fill the following shifts:
M,R,F,S,S 07:00 - 15:00 ET
M,T,W,R,F 15:00 - 23:00 ET
M,R,F,S,S 23:00 - 07:00 ET
R = Thursday
Please send a text (ASCII) or HTML version of your resume to careers@site5.com with the subject line: Systems Administrator - Remote (BL40706-001).
Selected applicants will receive a one-time cash bonus upon successful completion of our initial probationairy period. Cash bonuses range from $2,000.00 USD - $5,000.00 USD depending on shift & qualifications.
Site5 Internet Solutions, Inc. is an equal opportunity employer.
Site5 does NOT accept resumes from agencies or similar. Please do not remit invoices to our contact email address, physical mailing address, fax or to any of our employees or contractors. Site5 will not be held responsible for any fees related to unsolicited resumes/correspondence.
Posted by Todd Mitchell (Chief Operations Officers) at April 30, 2006 03:22 AM.
Wow, the last few weeks at Site5 have been amazing. We’re moving and shaking behind the scenes. Improving our existing services while forging ahead as an industry leading company.
We just recently hired a new Sr. System Administrator. You may see some responses from Andrew Galloway over the next few weeks. He started last Tuesday and has taken to Site5 like a fish to water
Andrew will primary be off the help desk. One of his many job functions is to ensure our server fleet is running at its very best. He’ll be doing proactive maintenance, reactive measures to correct random issues and a ton of other things.
So if you see Andrew around, shoot him a welcome message!
Adam's Development Blog - Backstage: Ticket System Functionality is coming!
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at April 20, 2006 09:30 PM.
The most common request we have received since Backstage was launched was the ability for customers to view and respond to tickets on-line using a web interface. This functionality had existed in MySite5, and was not included in the initial release of Backstage our several reasons. First, we really did not know that so many people loved this functionality in MS5. Second, we have plans to replace our current ticket system with something more closely integrated with our Synco/Backstage platform. The other big issue was the back-end database structure of the ticket system we use currently makes even basic operations (such as getting a list of open tickets, or viewing a ticket) very complex involving queries to half dozen tables or more.
Internally, we have heard our client’s concerns/requests, and we are reorganizing our engineering schedule as a result. The ticket system replacement is being pushed back a bit on the schedule, so we have started working on a interface for our current ticket system within Backstage. Currently, we expect the initial functionality to allow you to view and respond to current tickets and view past tickets.
Posted by Todd Mitchell (Chief Operations Officers) at April 13, 2006 02:37 AM.
I know, I know. A little late. But late is better than never. Site5’s amazing engineering team released Backstage (Site5’s customer portal) about 2 weeks ago. Since then, everything has been flawless. If you haven’t already done so, and if you’re a Site5 client, log into Backstage. It will truly change the way you manage your hosting account(s).
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at April 12, 2006 02:38 AM.
It has been almost two weeks since we released Backstage to the Site5 community, and the response has been completely overwhelming. The massive amount of feedback we have received has been completely invaluable. People have shared their thoughts, feelings and opinions on everything from our color choices to browser compatibility.
In designing the interface of Backstage, we focused primarily on current versions of Mozilla Firefox, Safari, and Internet Explorer (for Windows). While our previous MySite5 apparently worked in a few additional browsers, we felt that in order to focus on the usability and keep the design as lightweight as possible was a good decision for the release overall to limit our browser focus to the most popular browsers among our clients. A number of people asked why we didn’t include Internet Explorer for the Macintosh, and the simple answer is because that to the best of my knowledge no one supports that browser any more (neither Apple or Microsoft). I would be interested to hear from other developers and designers out there what browsers you develop for when making sites/applications.
Over the first two weeks of production we have made about a dozen small incremental releases correcting bugs, adding small features, and improving system performance. I cannot sing the praises of Capistrano loudly enough. It truly makes deploying small incremental releases of your application a joy rather than a nightmare.
We have another iteration of Backstage in the pipeline at the moment, but after that is wrapped up our primary development focus will move towards some more behind the scenes software to assist our Operations staff in their day-to-day work.
Operations and Tactics - Site5 Internet Solutions, Inc. critical maintenance notification [Planned]
Posted by Todd Mitchell (Chief Operations Officers) at April 04, 2006 06:06 PM.
Site5 Internet Solutions, Inc. critical maintenance notification [Planned]:
Date: 04/05/2006 - 04/10/2006
Start time: 00:00 GMT
Estimated end time: 06:00 GMT
Services/Equipment: demodocus, leto, copernic, cheiron, neysa
Type of work: Hardware swap / restore
Purpose of work: Replace primary hard disk
Impact of work: See below:
Due to space constraints on a number of servers, Site5 will be upgrading the hard disks in the abovementioned servers. This maintenance will require that we take the server offline for a period of time to complete this critical maintenance.
During this maintenance you will not have access to your web hosting account(s). We do not anticipate any data loss & email sent to your domains during this maintenance window will be queued & delivered as soon as the server comes back online.
Total downtime expected is approx. 5 hours.
I apologize in advance for this downtime, I understand that it is inconvenient but required at this time.
Should you have any questions or concerns related to the abovementioned maintenance, please feel free to contact us http://www.site5.com/about/contact.php.
Regards,
Todd Mitchell
Site5 Internet Solutions, Inc.
Note: this email has been sent from an email address that does not exist. This was done in order to prevent bounce back messages from clients who have outdated contact email addresses on file. Please check our web site at http://www.site5.com for our current contact email addresses.
Operations and Tactics - Update related to Backstage, technical support requests & FlashBack PRIME
Posted by Todd Mitchell (Chief Operations Officers) at April 04, 2006 03:20 PM.
Site5 has recently released & is preparing to release new services for our clients. As a result, I’d like to provide everyone with a few updates to ease concerns:
Backstage: After many months of work, Backstage (replacement for MySite5) was released on Friday. All clients received brand new login information for the new system. Throughout Friday & over the weekend we started to receive various bug reports, feature requests, and general ‘how do I do this’ type emails.
As a result of this influx in tickets, technical support has been a little backed up. We have added staff to the morning & evening shifts to help with the number of tickets flowing into the system. Support responses will be a little slower than usual as we work through this influx of tickets. Per our guarantee, all tickets will still be responded to within 24 hours, and most of them will receive a response within 2 hours, if not even sooner!
I should note that all urgent server issues will continue to be handled immediately. We have redundant monitoring systems in place that detect and notify us of issues. So if you notice that a server is offline, chances are, we are aware of the issue.
Please bear with us as we work through this influx. I expect to have things back to normal within the next 72 hours (if not sooner).
FlashBack PRIME: We will be re-seeding (preparing) FlashBack PRIME on all of our servers over the next 48 hours. This process started last night and will continue over the next 2 nights. During this process you will notice an increase in system load. This is due to the disk access required by FlashBack PRIME. We do have monitoring systems in place, so if you notice a higher than normal load, we will notice it on our monitoring screens.
New Orders: Due to an issue with new order provisioning, I have suspended order provisioning for the next couple of hours as we trouble shoot an issue where welcome emails are not being dispatched form our billing system. I’ll reply back as soon as order provisioning is resumed.
I apologize in advance if any of the abovementioned issues have caused you any inconvenience.
Operations and Tactics - What is system load & how does it affect me?
Posted by Todd Mitchell (Chief Operations Officers) at April 03, 2006 11:15 PM.
I get this question often and naturally as a web host, we deal with load on a daily basis. Clients see load in their control panels, sometimes see a green or red light next to a load average. But what does it all mean? Is a low load good? Is a high load good? Why are there 3 load averages?
First, let’s handle the 3 load numbers. When looking at load on a server you’ll see the following:
root@leto [~]# w
18:45:26 up 7 days, 7:49, 4 users, load average: 2.64, 2.21, 2.04
Scroll over to the right and you’ll see 2.64, 2.21 & 2.04. The first number is 1 minute load average. The second number is the 5 minute load average and finally the 15 minute load average. This is a very small historical representation of load on a server.
But what is load? The load number is derived from a number of sources. The first is the number of waiting processes on a server (CPU utilization), the second is disk I/O, the third is network I/O, RAM usage, and the list goes on. Basically load is an overall representation of a server.
That said, load is NOT a good indicator of how a server is actually performing. People generally associate a high load with poor performance. This is more than often not the case. You need to consider the servers hardware configuration, the software applications on the server, etc. You can see many servers running with loads of 50 - 100 and the users will not notice any performance degradation.
This generally holds true with our servers. Our web servers & MySQL servers are finely tuned & tweaked. If a server has a load spike, the server is often able to handle the increase in resource usage without too much issue. The load average numbers will increase, but are not an accurate representation of the systems overall performance or ceiling.
So if a system has a high load, this doesn’t mean poor performance. It means that the server is busy utilizing its resources rather than sitting idle. Any active multi-user server will have a load. A higher than normal load is an indicator that something may be wrong, but doesn’t indicate that there is something wrong.
I think a real indicator is load times for pages and generally sluggishness. If the server feels slow, there’s something wrong. Rather than attempting to interpret, what appears to be, a rather arbitrary number that only the kernel source developers from the 70s can explain correctly.
Adam's Development Blog - Typo Release: 2.6.0 w/ bundled rails (and Rails 1.1.1)
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at April 03, 2006 12:13 PM.
The Typo Team has made a release available that includes Rails 1.0 bundled into the vendor directory with typo. The Rails Core group posted a quick note about the issues that followed the release of Rails 1.1 (for most people, these issues related to typo not working). Site5 strongly recommends that everyone deploying production applications on our shared hosting servers take the time to freeze all of their gems (including rails) on versions that you know to work properly with your software. Site5 also continues to make previous versions of ruby gems available after upgrades, which means customers will also be able to take advantage of the new RAILS_GEM_VERSION constant that will be available in 1.1.1
I will be delaying plans to push forward with a fleet-wide Rails upgrade until the 1.1.1 release becomes available.
The Fivefold Path - Testing Resumes in Certifying Keys
Posted by Matt Lightner (CEO and Lead Systems Architect) at April 03, 2006 12:03 AM.
Through holes in security, I successfully appended new and productive resources into limited features of our lifeline servers. Just observing key environmental outcomes, nearly each descriptor appreciated. Your loads and terminal effects will exponentially heighten our processor’s effectiveness; you only use html and ‘data accumulation’ gradients on our ‘data operating nexus environment.’ Since internal testing exceeded 5 levels of varying effectiveness, servers in the study call users simply to operate metric external research simulators. Judging on killed experiments, samples are reinforced excepting ‘first user naturalization.’ Keeping everything equal, productivity increases through a set equation calculating resources every time.
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at March 31, 2006 05:56 AM.
This evening the Engineering Team here at Site5 launched Backstage, the replacement for our older MySite5 centralized login system. Kevin made a great post on the Site5 weblog including screen shots of the new improved interface. I've discussed a lot of the major changes that went into this system in a previous blog post, so I won't bore everyone with the Changelog.
The e-mails are literally going out as I speak, but this represents another fairly large step forward in software at Site5. Now that the primary client portal is integrated so closely with integrated with our core CRM system, we have an unprecedented flexibility to integrate a number of the cool ideas that we've been cooking on the back burner. Even as this first version just hits release, we already have three pretty cool additions to it in progress and the entire development team seems extremely excited about the potential for expanding Backstage.
With this first release, I feel like we finally gave usability it's proper reverence. We spent the time and effort to make everything in the interface as apparent as possible from the moment you log in for the first time. Anyway, this post is a little rambling so I think that will be all for now, I'm sure after I wind down I'll have more to say, but my feelings at this point that be perfectly expressed in one word. "Woot"
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at March 29, 2006 07:06 AM.
Earlier this evening I began the process of pushing Rails 1.1 out to Site5's hosting fleet. All servers should have this upgrade by morning (as of this writing more than 80% have it already). If you don't have it by tomorrow, please open a ticket with our support staff and they will get it installed for you. If you have an application that requires a previous version of rails they are still available on all of our hosting systems. You can find the release announcement on the RubyOnRails weblog and if your interested in information about what has changed, you should refer to great posts from Scott Raymond and Mike Clark on the subject.
For those Site5 customers using typo, the typo team is aware of the issues with typo and Rails 1.1, and I'm sure more information will be forth coming on how they plan to address the situation (and what they recommend you do in the meantime). For the time being, they are recommending (in their IRC channel) that typo users use a frozen version of Rails 1.0. You can do this one of two ways. First you can type "svn export http://dev.rubyonrails.org/svn/rails/tags/rel_1-0-0 vendor/rails" from inside the typo directory at a shell prompt. The second option is you can download a copy of Rails 1.0 (which I have already frozen) and upload it into the vendor directory of your typo install.
Just a note to everyone deploying production Rails applications (including typo) at Site5, we recommend that you follow some advice posted on the Rails weblog and freeze the version of rails your application is using so you can upgrade as your ready, rather than just using whatever the latest rails available on the server is. This is something that we do with all of our internal applications.
UPDATE: I've rollback of rails on our systems for the time being because it was just causing a headache for too many customers. I'll look at completing this roll out again in the future. For the time being we will be sticking with rails 1.0. If you want to use the new functionality in rails 1.1 for your application, you can run "svn export http://dev.rubyonrails.org/svn/rails/tags/rel_1-1-0 vendor/rails" from inside your rails application directory. I still recommend you follow the advice provided above about freezing the framework for your application, especially if it is in production. For the time being it looks as if we will need to begin addressing rails upgrades in the same way we address php upgrades (longer wait after release to accommodate testing and maintenance notices). I guess I was optimistic after the relatively painless rails 1.0 release that future framework updates could be rolled out as rapidly.
Posted by Todd Mitchell (Chief Operations Officers) at March 28, 2006 06:23 PM.
Ever wonder what happens to a bad server. You know, a server that constantly has hardware issues, never runs right and just frustrates everyone involved (Site5 staff & clients)?
Fortunately, we took some video of how Site5 handles bad servers. We like to set an example so that other servers know their fate should they get out of line. Check out the video over on our combined corporate blog.
Enjoy.
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at March 28, 2006 11:26 AM.
Lets start with the hard truth. I don’t participate in our community forums nearly as much as I should. There are a variety of reasons for that, but the past 24 hours has reminded me yet again why this situation has come about. At this point, I’ve pretty much started viewing our forums almost entirely through an RSS reader (so I do still get the general feel for what is going on there). When I respond to a thread it probably gets closer attention than some of the other threads floating around at the time. I would assume that this is in no small part because my name appears in bright red all over our forums. Periodically this attention sparks a positive reaction. It might spawn a rich discussion or inspire others to spend a few extra moments reading over content that benefits the community. I usually just do it in the hopes of sharing a bit of wisdom, and perhaps a little insight into what is going on in my corner of Site5. More often than not a well-intentioned response from me causes a negative reaction. Either the thread goes off topic, or a member of the community posts their concerns in an effort to draw my attention to them. The result is either the thread spinning out of control, or one of our moderators taking action to resolve the problem.
The other unfortunate consequence is that while posts are made based on the information available at that time, when things change most people never think to return to a previous post they may have made on the subject to revise it. This results in misinformation floating about, and it really isn’t even specific to forums. Blogs, un-maintained websites, even paper-based documentation can all easily fall out of sync with the reality of a situation over time.
Today, I saw a forum thread fly by in my RSS reader that caught my eye. It was a customer asking about the current status of some of our outstanding software projects. As faithful readers of this blog no doubt your remember that I’m trying to walk the delicate balance of not promising specific deadlines (which would always be subject to change) and providing our customers with an idea of what is going on. My response was essentially that one of our products was in the process of being deployed fleet wide and to expect an official update regarding the second project in the next two days. I implicitly noted I simply did not want to pre-empt a formal announcement (which has already been written and is ready to go) with an impromptu announcement made on the forums. It seemed to me (then as well as now) that this was the best course of action available. It was a specific answer to a specific question, and a quick note that more news would be forthcoming.
When I was perusing the management review queue this evening (something I also do to keep abreast of what our clients concerns are) I stumbled into someone complaining about my response. Specifically about that part that further information would be forthcoming.
From now on, I’ll just put this out there in no uncertain terms. If you are a Site5 customer, and you have concerns about any of our current projects, my inbox is always open. I would ask that you make first attempt to receive a response regarding the issue from our support department. My e-mail is easily located on the about management page, but e-mailing my first initial last name at site five dot com will also work. Until I can work out why to be a beneficial member of our forum community, I will be taking a hiatus from them (outside of minor comments, and official announcements.)
Posted by Todd Mitchell (Chief Operations Officers) at March 28, 2006 02:16 AM.
Great news over at NewsGator. Nick & crew have hit a major milestone. FeedDemon 2 is now Gold! As I have stated in the past, FeedDemon is by *far* the best RSS aggregator available today for the Windows platform.
If you haven’t already done so, go grab a copy of the FeedDemon trial on the NewsGator site. Within a week you’ll switch from your existing client / web app.
If you don’t know what RSS is or why you’d need an aggregator, download a copy anyway. FeedDemon will ease you into RSS.
Once again, congrats to Nick et al. Very nice work on the most recent version. Now, when is version 3 due?
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at March 27, 2006 11:24 PM.
It's that time again! I'll be attending PhillyOnRails tomorrow night (March, 28th at 7pm).
Posted by Todd Mitchell (Chief Operations Officers) at March 20, 2006 05:31 PM.
I’m super excited. Today, 2 new system administrators are joining our highly skilled technical-operations team. We’re adding one additional system administrator to first shift (07:00 - 15:00 ET) and another to second shift (15:00 - 23:00 ET).
It will of course take about a week of training before the 2 new admins are fully up to speed. However, each of them are a great addition to our team. Expect to start seeing replies from them in the very near future!
Karmic Coding - Interview on Site5 and Web 2.0
Posted by David Felstead (Senior Engineer) at March 18, 2006 01:54 AM.
Two days ago I was interviewed by Charles Wright of The Bleeding Edge, a blog that is a spin-off of his column of the same name in The Age and Sydney Morning-Herald newspapers. Charles has been working on a 2000-word magazine feature on Web 2.0 services and technologies and has been interviewing people from around Melbourne, Australia who are involved in this new web shift.
We had a nice chat on the phone about my own views on the whole “Web 2.0” hype, Ruby on Rails, why Site5 is the best web host out there and about some of our projects, especially Flashback. A brief summary of the results of our chat can be seen here, and I’ll be sure to post as soon as his magazine feature is released!
Posted by Todd Mitchell (Chief Operations Officers) at March 15, 2006 03:06 AM.
If you run a WordPress powered web site, you should be aware that WordPress recently (5 days ago) pushed out a security release to patch up a couple of private issues. If you haven’t already upgraded your WordPress install, you might want to do so ASAP.
More details over at WordPress.
Operations and Tactics - If your garage door breaks, leave it down for the night
Posted by Todd Mitchell (Chief Operations Officers) at March 15, 2006 02:39 AM.
So I live in a condo–recently moved out of a house. Love the condo experience. No house maintenance, living with other people is cool and I don’t have to shovel the snow
One bad thing about condo living, someone else makes decisions. Last night the garage door broke, apparently a spring popped or something rendering the door somewhat useless. So rather than leaving the door down, the maintenance person decided it was a good idea to leave it up for the night.
On the outset, this seems like a good idea. People can come and go without issue. One problem, people who don’t live in the building can come and go as they please. And well, that’s exactly what happened.
Sometime between midnight and 6 AM someone walked into the garage and spray painted about 15 cars. Nothing nice, nothing impressive. Just spray painted crap across a bunch of cars and trucks. It turns out that the paint is soluble–its a good thing. It would probably cost $5,000.00 to repaint my truck.
Anyway, when I got up this morning and found out about the garage…well….I wanted to go back to bed.
Operations and Tactics - Senior Linux Ninja/Guru (Systems Administrator) - Remote / Telecommute
Posted by Todd Mitchell (Chief Operations Officers) at March 13, 2006 01:00 PM.
Site5 is once again on the hunt for some very exceptional people. If you’re looking for a challenging career in the web hosting industry as a Sr. Linux System Administrator, Site5 is the company for you.One thing to keep in mind–Site5 is not your average company. Site5 is filled with talented people and teams that like and want to work. Site5 is progressive, original and proactive. Site5 is exceptional and we’re in need of exceptional people.
How do we create such a dynamic and enjoyable work environment?
- We only hire great people.
- Our employees are dependable and reliable.
- We encourage learning additional skills.
- The management team actually cares.
Sound like a company you’d like to be part of? Excellent–because we’d love to meet you!
Are you a Linux ninja? Do you live & breath Linux? Do you dream of shell sessions & bash scripts? Are you the best Linux sysadmin you know? If you answered yes to all of the above, we need to hear from you–keep reading.
System Administrators at Site5 are an integral part of our success. Our administrators keep our site, our clients sites and our infrastructure in amazing condition. You will be responsible & part of a shift based team that analyzes and finds solutions for daily issues that arise in our server fleet.
System Administrators are also responsible for direct, front-end support for all of our valued clients. Our clients expect & deserve the best possible support when an issue arises with their account/service/server, etc. Removing level I/II help desk admins ensures that our clients issues are resolved quickly & efficiently, the first time around.
Like most employees at Site5, you will sport many hats throughout your day. Some include, but are not limited to, Site5 evangelist, customer service guru and master Linux sysadmin.
Our work environment is fast paced & progressive. The prerequisites for this position include strong troubleshooting / problem solving skill (Linux & basic hardware), a hardcore knowledge of Linux & networked environments, impeccable customer service abilities as well as the ability to learn / pickup new applications/concepts/ideas at a very rapid pace.
Very strong communication skills, interpersonal skills & scripting skills are paramount.
Essentials to apply for the Site5 Sr. System Administration position:
- 3+ years experience as a Linux Systems Administrator in a production environment.
- Experience in critical infrastructure / data center environments.
- Experience on Service Level Agreement help desks is a plus.
- Experience in the web hosting industry is a plus.
- Proven technical troubleshooting experience.
- Guru knowledge of Linux, basic networking and shell scripting.
- Knowledge of Perl/Ruby/Ruby on Rails programming a plus.
- Experience with Apache 1.3/2, PHP 4/5, MySQL 4 or Exim is a plus.
- Excellent written & verbal communication skills.
- Hardcore Linux experience in a production environment is a must.
Please send a text (ASCII) or HTML version of your resume to careers@site5.com with the subject line: Senior Systems Administrator - Remote (HO030106-001).
Selected applicants will receive a one-time cash bonus upon successful completion of our initial probationairy period. Cash bonuses range from $2,000.00 USD - $5,000.00 USD depending on shift & qualifications.
Applicant must be a legal resident of the United States of America and have the ability to work legally within the United States of America. Site5 Internet Solutions, Inc. is an equal opportunity employer.
Site5 does NOT accept resumes from agencies or similar. Please do not remit invoices to our contact email address, physical mailing address, fax or to any of our employees or contractors. Site5 will not be held responsible for any fees related to unsolicited resumes/correspondence.
Karmic Coding - Under the hood with Ruby - partial mock objects for unit testing
Posted by David Felstead (Senior Engineer) at March 12, 2006 03:10 AM.
Automated unit testing is now a mainstream concept in software development. The basic idea, for those who haven’t experienced it is to write a battery of methods to poke and probe components of your application to make sure it’s doing what it should be, and report any failures – some sort of script or program is then run over the test battery to pick out any problems. Although it involves a larger up-front time investment, as your code evolves and expands it’s a massive time saver as it takes care of a lot of your regression testing automatically. Couple that with a continuous integration tool (we Site5 engineers use CIA with our Ruby on Rails projects) and you can (with a little effort) end up with a very tight, well tested development cycle.
One of the more useful concepts in automated unit testing and test driven development is the idea of mock objects – basically these are clones or extensions of your application’s objects modified slightly to allow them to operate in your test environment. Typically you will mock objects because:
- You cannot use the real object (perhaps it interfaces to an external component like a credit card processing gateway) or;
- The object you are testing against isn’t finalized or completed or;
- You want the object to behave (or fail) in a certain, predetermined way.
However, like all good programmers, I’m kind of lazy, and unfortunately, writing mock objects can be a very testing (heh) and arduous process, especially writing a million different mocks for a million different scenarios. The dynamic nature of ruby makes extending on the fly much easier, and I’m going to outline a little hack I thought up to help myself with point 3 above.
Now overriding the functionality in all the objects of a particular class is spectacularly easy in ruby – you can simply do something like this:
Before:
puts 1 => 1
# Here's our weird contrived mock object
class Fixnum
def to_s
"I don't like numbers"
end
end
After:
puts 1 => I don't like numbers
What happens though, if you want to override a class in an instance of an object, and not all of its kind? Typically you would define a mock object, and create an instance of it. But, in Ruby there is an easier and faster way that doesn’t involve writing a different mock class for each different scenario – and it is made possible by the singleton class. This clever bit of ruby hackery lets you override the behaviour of a single instance of a class, creating what I’ve decided to call a partial mock object. To demonstrate, I’ve written a small method called override_method which will override the behaviour of the specified method in the passed object, like so:
# Overrides the method +method_name+ in +obj+ with the passed block
def override_method(obj, method_name, &block)
# Get the singleton class/eigenclass for 'obj'
klass = class <<obj; self; end
# Undefine the old method (using 'send' since 'undef_method' is protected)
klass.send(:undef_method, method_name)
# Create the new method
klass.send(:define_method, method_name, block)
end
# Just an example class
class Foo
def do_stuff
"I'm okay!"
end
end
# Test code
list = []
5.times { list.push(Foo.new) }
# We override the method here!
override_method(list.first, :do_stuff) { "I'm NOT okay!" }
list.each_with_index { |f, i| puts "(#{i}) #{f.do_stuff}" }
Outputs:
(0) I'm NOT okay!
(1) I'm okay!
(2) I'm okay!
(3) I'm okay!
(4) I'm okay!
As you can see, only the first object in the array’s behaviour has been changed – the rest have remained untouched. Because of this, you can embed these partial dynamic mock objects deeply into your code without the need to specially instantiate a mock object deep in your code, or writing a ‘clever mock’ to only trigger the determined behaviour in certain objects.
Where this code comes in really handy is when you need an object to raise a difficult to simulate exception (like a disk full error) on a certain method to test your error handling – simply call override_method and pass in a call to raise and voila! Dynamic partial mock objects on the fly!
Posted by Adam C. Greenfield (CTO and Hosting Systems Manager) at March 11, 2006 12:16 AM.
This week has been pretty fast and furious at Site5. We had several people out this week due to different situations, so that has slowed down progress on some of our projects, however we have two or three projects that are already in the testing/deployment phase at this point. This means that quite a bit of the outstanding work falls on my lap, however Todd took one of the large burdens off of my shoulders by agreeing to oversee the deployment on one of our major products. Scott has been making progress in pounding out a few new features for Backstage and I'm wrestling with some display issues with the interface that are driving me nuts.
Todd finally got the remaining hardware we needed to expand our backup system (and add another 8 Terabytes of disk space to it) earlier today and no doubt his weekend will be pretty busy getting all that into production. This will enable us to move forward with the deployment of a much anticipated update (not saying more yet) to one of our projects. Also on the operations side of thing we discussed today rolling out updates to PHP and mod_fastcgi. The PHP upgrade I think is a basic version upgrade, but the mod_fastcgi update will resolve some of the inconsistent issues with FastCGI based Rails applications that have been reported to us. You can expect to hear more from Todd about both those updates as they are scheduled and completed.
Site5 will also be sponsoring another PhillyOnRails meeting this month. Erin announced this meeting to the list last week and I just got the final contracts from the Holiday Inn. As a result of the great turn out we had last month I've increased the amount of seating we will be setting up, however please RSVP so we aren't in the same boat we were last month (with almost twice the number of people we expected). We have talks on Script.aculo.us and web application security lined up this month so it should be another great meeting.
Operations and Tactics - Telling the Difference Between Web 1.0 and Web 2.0
Posted by Todd Mitchell (Chief Operations Officers) at March 08, 2006 06:36 PM.
The guys over at Tucows posted this on thier blog today. I thought it was rather humerous:
Posted by Todd Mitchell (Chief Operations Officers) at March 07, 2006 02:32 PM.
I saw a post fly through my RSS aggregator the other day and thought it was interesting. When new technologies are developed there’s often a squabble over the name, the icon, who’s listed on the founders page, etc. So I thought it was clever that someone has jumped ahead of this and created a standard for the RSS/XML orange icon.
You can grab the original artwork from FeedIcons. I should also note that the Mozilla Foundation (creators of FireFox, Mozilla, Thunderbird, etc.) as well as Microsoft have adopted this new icon. So there is very little doubt in my mind that this icon will have any trouble taking off.
Operations and Tactics - Is it even possible to make FireFox quicker?
Posted by Todd Mitchell (Chief Operations Officers) at March 07, 2006 12:33 AM.
If you’re an avid FireFox user, you’re probably well aware of this extension. If you’re new to FireFox or don’t like to dabble with FireFox’ internal workings–this extension is for you.
FasterFox is an interesting beast. The main issue with this extension is that you will probably only use it once. You’ll tweak your FireFox install and then forget about it. But it does save you from having to mess around with about:config.
So give the extension a shot. Try not to go so crazy on the number of simultaneous connections to a server. That tends to hurt hosts if too many people are doing it
Posted by Todd Mitchell (Chief Operations Officers) at March 06, 2006 04:07 PM.
Jeremy Zawodny posted a interesting technique on sorting email and it has me rethinking my sorting methodologies. I’m essentially in the same boat–looking for a convenient & efficient way to correctly sort email so that everything is taken care of and nothing falls between the cracks. At the moment I receive on average around 700 - 900 emails a day. Of which, I have to pay attention to about 400 and act on less than 50.
Right now I use a folder sort option. Anything that is from this person or contains this in the subject, push to this folder. This has worked for a couple of years but the system is slowly breaking down as email intake increases.
Hopefully Jeremy will have a follow-up post on how the new methodology is working. But I started to wonder, what does he use as a calendar & contact manager? What do you use to manage email/contacts/calendar?
Posted by Todd Mitchell (Chief Operations Officers) at March 06, 2006 02:16 PM.
I used to pipe all of my music through Windows Media Player. It just seemed to work well for what I needed and didn’t bog my system down like older versions of iTunes.
Over the weekend I installed the most recent version of iTunes and I’m impressed at the changes Apple has made to make iTunes less of a draw on system resources. Previously iTunes would make my machines drag hardcore to the point where I removed it from my system.
So far, so good. I’ve been using iTunes for a few hours now and I like it. Music sorting/organization just works. And offloading music to CDs is a no brainer. I’m going to use iTunes for the next couple of weeks to see if it keeps pace. I’ll report back more in a couple of weeks.
The Fivefold Path - Table row alternation helper for Rails
Posted by Matt Lightner (CEO and Lead Systems Architect) at March 05, 2006 11:01 PM.
Anyone who has ever had the misfortune of doing large-scale user interface construction will be familiar with table row background color alternation. While alternating row backgrounds arguably increase usability, they almost certainly increase developer headaches. There have been many attempts to simplify this process, including some feindishly clever Javascript libraries. However at the end of the day, I personally prefer having the class definitions hard-coded into the HTML output.
Fortunately, Rails helpers make the process of table row alternation a non-issue. There are about as many ways to go about solving this problem as there are developers, but I’ve found that a very simplistic solution worked perfectly well. Here’s the code I ultimately ended up sticking with:
# Helper method for table row class alternation
#
# Usage: In your view call < %= alt -%> somewhere within your TR tag.
def alt(class_name = 'alt')
@alternator || reset_alt
@alternator.to_i += 1
(@alternator % 2 == 1) ? " class="#{class_name}" " : ''
end
# Call < % reset_alt -%> before the first row of your table to
# ensure that all tables start with the same color row.
def reset_alt
@alternator = 0
end
Simply define an “alt” class in your CSS and you’re all set!
Posted by Todd Mitchell (Chief Operations Officers) at March 03, 2006 10:44 PM.
I dropped the last.fm plugin into my wordpress installation. You can now see the last 10 songs I’ve listened to. Since a lot of people ask about what music I listen to, I figured this was more than appropriate.
Enjoy.
Operations and Tactics - The music that’s pumping through my speakers this week
Posted by Todd Mitchell (Chief Operations Officers) at March 03, 2006 07:08 AM.
Picked up a few new albums this past week. Currently have the following in my playlist:
Dem Franchize Boyz, Gorillaz, Matisyahu, The All-American Rejects & Yellowcard.
Todd
Operations and Tactics - How to score a hotel room in NYC while sitting in traffic
Posted by Todd Mitchell (Chief Operations Officers) at March 03, 2006 07:02 AM.
Kevin Hazard & I decided to stay in NYC for a few days earlier this week. One hitch, we didn’t have a hotel room. We dropped a Site5 team member off in Brooklyn and then headed into Manhattan.
Thinking that the only way we can get a hotel is on the net (for a reasonable price), I pulled my Yukon over. Oddly enough, double parked across the exit to a fire station. We pull out my laptop, pop in the high gain wireless card…and bang…we have access to ~8 networks.
We decide to jump onto a network. I know, I know. Some people think this isn’t cool. My thinking is. If you leave your wifi access point open and you’re connecting to check mail or surf the net for a short period of time, I think it’s fine. I leave my wifi access point open so that people who visit my city can get internet access if need be.
I digress. Kevin suggests that we jump on hotwire.com to see what we can find in the Time Square area. We find a hotel in the area, 4 star, internet access & parking. If you haven’t used hotwire, its an interesting service. You pay the price you see, but you only find out the actual hotel name *after* you pay.
That said, we find a decent deal and lock it in. It turns out to be an insane deal. We scored a room at the Westin for $99/night (rack rate is $379/night). So, as you can see, we saved a ton of cash.
Following that night, we used hotwire.com again. We ended up getting a very good deal at the Hyatt just down the street (next to Grand Central). Hotwire has a perfect score with me at this point.
Todd
Posted by Todd Mitchell (Chief Operations Officers) at March 03, 2006 06:48 AM.
I spent last Thursday to Sunday in New Jersey with a ton of Site5 team members. Six of us drove, flew and took the train into New Jersey for a few meetings. Mostly to do a few checks on our equipment/networks in NJ. Also just to take some time away from the computer to hang out with our fellow employees. Some of which we were meeting for the first time!
Great time was had all around. A couple of odd notes about NJ if you’ve never been there before.
1. You can’t make left hand turns. You must use ‘jug handles’ to make left hand turns. You basically have to do a figure eight to make a left hand turn. Amazing waste of time, gas and land.
2. You can’t pump your own gas. No matter what gas station you go to, they’re all full service. I like to pump my own gas, it gets me out of the truck. So this law kind of annoys me.
3. Don’t speed. The state troopers are ruthless
Todd
Karmic Coding - Simple, resilient interprocess locking In Ruby
Posted by David Felstead (Senior Engineer) at March 03, 2006 05:21 AM.
One of the major hurdles in developing FlashbackPRIME was handling repository locking between processes. Since we basically started from the ground up, unfortunately we lost some of the benefits of having established libraries do this work for us, and as it happened, had to re-invent the wheel in some areas.
What we wanted was a very simple, barebones method of locking an arbitrary resource (at the application level) so that all requests to it could be serialized. The more difficult part was making sure that the locks were available between processes, since FlashbackPRIME consists of several components – the web application and the back end daemons handling sweeping and restoring. So the inter-process requirement pretty much ruled out ruby’s Mutex and its associates, and I didn’t want to have to rely on a daemon or service running, so that eliminates DRb and Rinda. Going back to basics, it seemed that simple filesystem based file locks were a good match. The filesystem allows exclusive locking of files, and it is more or less a portable solution – preferable, since I develop on Mac OS X whereas Site5’s production servers are mostly CentOS Linux.
The final requirement is that the locking mechanism needs to be resilient – if a lock collision is detected, the application should continue to attempt to attain the lock several times before raising an exception. Since there is no single central arbiter to handle distributing locks, I decided that a random exponential backoff retry strategy (ala Ethernet) would be sufficient – access to the resource should be rare and more or less randomly distributed, so whilst this isn’t a foolproof method, it has tested very well.
# Try to lock the resource and execute passed block within context of lock
def try_lock(options={})
# Set default options
lockfile_path = options[:lock_file] || 'lockfile.lock'
retries = options[:retries] || 10
retry_period = options[:retry_period] || 0.5
# Shared or exclusive lock?
locking_method = options[:readonly_lock] ? File::LOCK_SH : File::LOCK_EX
retries.times do |attempt|
lockfile = File.open(lockfile_path, "a")
locked = lockfile.flock(locking_method | File::LOCK_NB)
if locked then
begin
lockfile.truncate(0)
lockfile.puts(Process.pid)
lockfile.flush
retval = yield
lockfile.close
return retval
rescue Exception => ex
lockfile.close
raise ex
end
else
lockfile.close rescue nil
# Calculate exponential random backoff ala ethernet
backoff_time = rand * retry_period * (2 ** attempt)
STDERR.puts("Lock on '#{lock_type}' failed (pid:#{Process.pid}) - " +
"#{attempt+1}/#{retries} (backing off " +
"#{sprintf("%.2f", backoff_time)} seconds)")
sleep(backoff_time)
end
end
# If we get here, we're out of retries
raise "Locking Error"
end
It’s not the shortest piece of code, but it’s proven to be very reliable thus far. It’s used in the following way (all parameters are optional by the way, the defaults are in the code above):
try_lock(:lockfile_path => '/var/run/lockfile.pid',
:readonly_lock => false,
:retries => 5,
:retry_period => 0.5 ) {
...code accessing shared resource goes in here...
}
This is a nice little way of synchronizing access to shared resources via application level code – it’s relatively portable (Linux, Mac
OS X and Windows so far) and most importantly of all, it’s fairly resilient and self-repairing. Even deadlocks aren’t a major issue, as they will time out eventually – not ideal, but better than the alternative. One final caveat though:
these lockfiles will not work across NFS mounted drives! That can be done, but I suspect you’ll need to look at doing some cleverer
POSIX style locking using
fcntl and its bretheren.
Posted by Scott Deming (Senior Engineer) at February 27, 2006 03:50 AM.
Every now and then you run into a problem that can be resolved in only one way. A kludge. A kludge is generally what happens when you are forced to work around problems you can’t control. It happens a lot more frequently than it should, and it almost always comes back to bite you in the ass.
A few months ago was such a time. In the paragraphs that follow I’ll present to you a kludge that was perfectly functional but gave me some serious headaches. I am not entirely sure how my associates felt about this particular kludge but it was both discomforting and annoying. It should also have never been an issue, but I’ll elaborate on that further down.
The Stage
Allow me a few more moments and then I’ll get to the point.
In a nutshell, Flashback is a versioning system that automatically sweeps and stores changes for Site5’s customers web space. The key function is to provide rapid undo capabilities for simple or massive changes allowing a customer to restore individual files or even their entire web space to any point in time, with just a few clicks.
The Problem
Flashback originally used Subversion as the underbelly for the version control system. Subversion has a nasty habit of storing its meta-data in .svn directories that happen to be right there within the data being versioned. For anyone who has tarred up their work area while using Subversion (or CVS for that matter) this is evident, and it generally isn’t annoying; though in our case it was a major road block. You see, we can’t be polluting users home directories with a bunch of Subversion meta-data without their consent. Even with their consent it would be inconsiderate at best, and destructive at worst. We just couldn’t allow this behavior to persist. Unfortunately Subversion doesn’t provide an alternative location for storing this meta-data and so our adventure begins with finding a work-around.
Phase 1 – Research
Many ideas were tossed around, from using a union file system to staging the directories prior to versioning them, to patching Subversion directly. I am pretty sure each and every member of the Site5 Engineering Team had an idea or two, but I rather unfortunately forget who had what ideas so I am unable to provide credit where credit is due. At one point I think we even discussed using something other than Subversion!
Phase 2 – Trial and Error
I worked diligently in an attempt to get a working UnionFS, hacked just right, so we could hide the .svn directories in a completely different place transparent to the customer. I am pretty sure David Felstead started working on the idea of staging the data, and even had a good deal of success with that.
Phase 3 – Aha!
With almost any tricky problem comes a tricky solution. This was no exception. I had a bright idea and I got to work immediately. I didn’t even bother to tell anyone about it until it was about 90% done, I knew it would work. All good kludges work, no matter how ugly they are. If it doesn’t work it isn’t a kludge, it’s a catastrophe.
The Kludge
So you want to know what it was do ya? It was almost too simple. You see, Linux (and UNIX in general) has this great facility in ld.so(8) that allows you to pre-load a set of dynamic libraries before the executable loads its own shared libs. This is invoked by setting the LD_PRELOAD environment variable prior to execution. So with a bit of strace(1) magic I set out to write my own shared library whose entire purpose was to intercept all file and directory related calls made during an invocation of Subversion (either by library call-out or svn executable) and rewrite the file paths, relocating every instance that contained “/.svn/” to a different directory tree.
Here is a small sample of how the code looked for the fopen(3) intercept function:
001 /**
002 * intercept fopen
003 */
004 FILE *fopen(const char *path, const char *mode)
005 {
006 static FILE *(*orig_func)();
007 if (!orig_func) {
008 orig_func = (FILE*(*)()) dlsym(RTLD_NEXT, "fopen");
009 }
010
011 IF_DOT_SVN(path) {
012 char *new_name = adjusted_filename(path);
013 FILE *ret = orig_func(new_name, mode);
014 free(new_name);
015 return ret;
016 }
017
018 return orig_func(path, mode);
019 }
This fopen function is loaded before the application loads its own libraries (including the standard C library). Since this version of fopen is loaded first, it trumps any that is loaded later. The call to dlsym on line 8 is how we find the original from the standard C library. So now, any time the application makes an fopen function call, the path is rewritten to point to an internal directory tree, outside of the users web space prior to calling the true fopen function. In all there were 24 different functions that had to be intercepted in order for Subversion to be completely covered. Lucky for us Subversion delegates most of these tasks to the Apache Portable Runtime which is pretty easy to mine. Using strace (or truss) is nice to detect system calls, but you still have to figure out where those calls originate. GDB can be extremely useful in this case, I highly recommend it.
Post Mortem
Ultimately this strategy proved to work very well. We were able to separate the Subversion meta-data from the users data. This went into production, and there were no problems caused by pre-loading our custom “fix_dotsvn.so”. It was easier than installing a UnionFS. It was faster than staging all of the data. But it did come with its own set of baggage. We had to be absolutely certain that the kludge was in place or we could easily corrupt a repository beyond repair. It’s a very easy mistake to make and not one that is easily fixed. This is a high price to pay.
This is why this is a Load-Bearing Kludge.
Load-Bearing Kludge
There is no real definition that I am aware of. The term actually comes from Eugene Szedenits, Jr., an individual I regard quite highly whom I worked with at Clareos previously to joining the Site5 Engineering Team. I will attempt to provide the definition as I see it, and with Gene's blessing:
Any kludge you cannot remove without causing the entire application to die a horrible death.
That about sums it up.
What now?
Our Load-Bearing Kludge no longer exists. Thanks to David Felstead’s incredible work we were able to supplant Subversion with a versioning system that he wrote in pure Ruby. For our application it blows the doors off of Subversion in both speed and reliability. You have to read this entry in his blog to get the full picture.
At the beginning of this article I mentioned that I’d explain why I don’t think we should have ever been put in this place to begin with. The explanation is simple: Subversion should not be polluting your source code with meta-data directories. There are better ways. It isn’t likely going to change any time soon, and I’m not likely to start using any of the alternative systems out there because in reality Subversion is the best I’ve seen. I just don’t like it when applications dirty up my source trees.
Karmic Coding - On performance: Sometimes the wheel just ain't up to scratch
Posted by David Felstead (Senior Engineer) at February 23, 2006 04:01 AM.
It’s one of the cornerstone concepts of programming these days – Don’t Re-Invent the Wheel. These days there are so many third party libraries, utilities and frameworks available that more often than not you would be crazy to write the difficult stuff yourself. Occasionally, however, you find yourself outside the “more often”, and run into one of those “not” situations, one where just throwing more hardware at the problem won’t make it go away. Just recently, the Site5 Engineering Team (of which I am a member) ran into one of those “not” situations. The product? Flashback.
The problem
Flashback is a really nice piece of software. It’s a file explorer for your webspace with a difference – it not only allows you to see your website now, but also as it was a day ago. Or a week ago. Or a month. You get the idea. Any changes you make in your webspace are picked up and versioned by the Flashback engine and are recorded for posterity. You want to revert back to your old layout? No problem. Want to retrieve those images you accidentally deleted? They’re there.
The core of the original Flashback used to be the source control management software Subversion (or SVN), which is a great tool to add to any developer’s repertoire – and joy of joys, it even comes with and external API and, more importantly to us: bindings for Ruby. Now at first glance, one would assume that SVN would be pretty fast and performant – after all, it’s written in C and has a thriving open-source community contributing to its development. Unfortunately, that assumption (the word should have raised alarm bells) came back to bite us. Whilst being a great source control system, it turns out that when it comes to performance and efficiency, SVN is a real dog. And you know what? That’s fine. It is the “more often” than the “not” that you don’t care about performance in managing your source code, and for what it’s designed for, SVN ain’t so bad. Anywhere outside its comfort zone though… BZZZZZT! – no good.
YOU (yes you) can always do it better
Off on a tangent for a second – back when I was at university, probably in my second year of a computer science degree, we were assigned the typical task of implementing a Quicksort algorithm in C, and benchmarking it against various other sorting algorithms. Of course, the cynics and the realists in the group wondered what the point of this was? Any programmer worth their salt knows that the C standard library’s qsort function implements the Quicksort – Why Re-invent the Wheel?. The surprise came when the class implemented the algorithm themselves and benchmarked it against the original qsort function. The result? Around 80% of the class had implemented a faster version of the algorithm, and these were second year uni students! A similar revelation came when a friend of mine, studying for his PhD re-implemented some of the functions in string.h (rather than relying on the standard library) in a very CPU intensive experimental search engine. The result? It ran about 40% faster.
The moral? When it comes to performance, you can always do it better. Why? Because you know the problem you’re trying to solve.
FlashbackPRIME – a faster, more efficient wheel
So it turns out that SVN wasn’t up to scratch, and not viable for long term deployment – it’s just too slow and too much of a resource hog. So what to do? The first step was taking a few benchmarks. As a test, I implemented a few algorithms (change detection and repository updating) in pure Ruby and measured them against the same functions in SVN. The performance results were amazing – the pure Ruby solution outperformed the C based SVN (with Ruby bindings) by several orders of magnitude – it was literally hundreds (sometimes thousands) of times faster. With this data in hand, the Site5 Management Team gave me the go-ahead to re-implement the guts of Flashback, and with our lovely modular design of the first system, slotting it in was a breeze.
The final feature set of FlashbackPRIME is comparable to
SVN’s:
- Both systems use a filesystem based repository
- They both have atomic, transactional commits with rollback capabilities
- Both have storage engines based on delta compression
- Both can store arbitrary metadata on items
There are a lot of things that SVN does that FlashbackPRIME does not, but the guts of the functionality is the same… and the results? Incredible. Here are some rough timings:
| Task |
Flashback/SVN |
FlashbackPRIME |
| Populating large repository (several gigabytes, thousands of files) |
about 2 hours |
about 126 seconds |
| Sweeping same repository for changes |
about 38 minutes |
about 81 seconds |
| Sweeping smaller repositories |
about 15 seconds |
less than 1 second! |
Very unscientific figures, but you get the jist.
Sometimes it comes to a point where the wheel just won’t cut it any more, and these are the times that YOU as a developer need to take control and say “You know what? I can do better than that.”
Posted by Matt Lightner (CEO and Lead Systems Architect) at February 22, 2006 06:49 PM.
Over here in Engineering, we like to do fun things every once in a while… (and even less often than that we’ll do some real work
)
After reading a comment left on Todd’s blog regarding Last.fm, I got the idea to use their “recently played songs” RSS feed to assemble an aggregated recently played songs list for the Engineering Team .
The recently played songs aggregator uses the same architecture (based largely on the Rails Planet crawler.rb engine) as our blog feed aggregator: every minute it polls each engineer’s Last.fm RSS feed and records any new entries in the engineering website’s main mySQL database.
From beginning to end, it took me approximately 30 minutes to write the model, controller, view and polling engine (isn’t Rails great?
). If you’d like to do something similar, I’ve posted code for the modifications made to the Rails Planet source to help get you started.
Check it out and let me know what you think!
http://engineering.site5.com/tracks
Operations and Tactics - engineering.site5.com now web 2.0 compatible ;)
Posted by Todd Mitchell (Chief Operations Officers) at February 22, 2006 05:20 AM.
Our amazing engineering team, between complex projects, spent some time revamping the Site5 Engineering site. There’s been a ton of changes as outlined on Matt Lightner’s blog. The most obvious change is the look–now contains 1/2 the calories AND is now web 2.0 compatible
If you haven’t already done so, add the Site5 Engineering RSS feed to your aggregator.
Todd
Posted by Matt Lightner (CEO and Lead Systems Architect) at February 19, 2006 08:18 PM.
As you may have guessed, the new Engineering Team website is up and running. Woo hoo!
This version begins to fulfill the original objectives we had for the site, and will become more and more valuable a resource as we add new functionality and content. Still, this release provides numerous style enhancements and new functionality:
- Cleaner, more Web 2.0-ish design.
- Pure-mySQL database backend for increased speed (the old site pulled from sqlite and mySQL)
- Developer listing with individual developer profiles, statistics and maybe even a picture. Here I am, for example.
- Project listing and project detail pages, showing all development activity, a synopsis of the project and stats by developer.
- An improved activity timeline.
- Integration with our Engineering Team IRC channel via a custom rbot plugin. Allows for viewing developer online status (idle time is almost working, but I ran into an odd DRb issue at the last minute).
- Replaced Python-based Planet feed aggregator with a hacked-up version of the Rails-based Rails Planet.
There will be lots more to come, including a public developer Pastebin, where you can get handy snippets of code from Site5 Engineers, a plugin repository, and maybe even an open-sourced application or two.
(Note: All of the links in this post go to engineering2.site5.com. The new site is currently available at that domain, and should be available at the original engineering.site5.com domain once any cached DNS records clear)
Posted by Kevin Hazard (General Administrator) at February 15, 2006 07:46 PM.
Just to get this post on the front page of the Engineering site (sneaky ulterior motive), I wanted to let everyone know that you can visit Site5’s offical blog at weblog.site5.com to stay up to date on future plans, current events, and random things that we find entertaining… usually with regard to technology and web hosting, but don’t hold us to that. If you have a few minutes, check out the first few articles, and you will get a glimpse of some of the projects you are seeing (or are not seeing) from the engineering team here.
Posted by Matt Lightner (CEO and Lead Systems Architect) at February 04, 2006 07:01 PM.
While the current Site5 Engineering Team website is nothing to scoff at, it is nowhere near the size and scope that we had originally intended. Our original plan for the site included a lot more dynamic content, as well as in-depth information on each of the projects we’re working on, not to mention the Site5 Engineers themselves.
This weekend and early next week, Rod and I will be preparing the second version of the E5 website. This one will encompass all of our original vision, along with a few pretty sweet surprises. E5 fans (and we know there are a lot of you out there) should check back frequently, as we’re not going to give you any more warning before the new site drops. Yeah, it’s not nearly as cool as the next iPod technology, but it’s still a little bit suspenseful, no?
The Fivefold Path - Watch Us Engineer (Almost Live!)
Posted by Matt Lightner (CEO and Lead Systems Architect) at February 02, 2006 03:41 PM.
Only second to watching paint dry, you might find it interesting to watch our engineering team’s code commit logs. For people unfamiliar with source code management systems (such as CVS or Subversion) a “commit” is how a developer applies changes made to his or her local copy of a project to the central repository.
Committing is done intermittently throughout the development process (based on our engineering policies, generally between 5 and 15 times per developer per day—when notable checkpoints are reached). Each commit entry on our development activity page includes a time, the name of the developer checking code in, the name of the Site5 project affected (I.E. “Flashback”, “FlashbackPRIME” [Flashback’s up-and-coming replacement from Senior Engineer David Felstead], “Synco” [our new billing and CRM backend], etc.) and message from the developer about the changes contained within. A word of caution though… the commit messages are completely uncensored, so there’s no telling what kind of crass (or upsettingly nerdtacular) content you’ll find there!
Got your lawn chair and tanning lotion ready to go? Head on over to the Site5 Engineering Team development activity page! (or “SETDAP” as we don’t refer to it
)
The Fivefold Path - Replacing my.Site5 with Synco and Rails
Posted by Matt Lightner (CEO and Lead Systems Architect) at January 29, 2006 12:23 AM.
From the beginning, part of our huge master plan has been for Synco to replace my.Site5—our client-side account management system. My.Site5 is not a control panel per-se, but rather a “meta-panel” that ties all of a customer’s various control panels together into a single system with only one login. This means that our customers with MultiSite accounts can access their MultiAdmin interface (which is what is used to create and manage MultiSite websites), and the SiteAdmin control panel for each MultiSite account without having to remember and re-enter a separate login and password for each. As far as I am aware, no other company with cPane