• Welcome to the new NAXJA Forum! If your password does not work, please use "Forgot your password?" link on the log-in page. Please feel free to reach out to [email protected] if we can provide any assistance.

Server Help

Lincoln

NAXJA Member #321
NAXJA Member
Location
ID
Ever had one of those weeks. The server is two hours away and I don't really want to make the trip down and not figure out the problem.

Stats:
2X Athlon 2000 MP's
2X 36GB SCSI (mirroring through Adaptec card)
512 MB Ram
Sony DDS-3 backup drive running on it's own card
OS is NT 4.0 Server SP 6.0a
Using NT Backup scheduling using the AT command

I having the dang machine lock durring backup. No mouse movement, nothing. A hard power off is the only way to bring it back up. It's not writing to any of the logs either.

The last entry in the system log show's it starting the backup of the D: drive. I ran the backup last night from here and watched it make it through the D drive but it hung on the E. Those are the only two volumes.

Ideas? I'll try and check back during the day, but it just happens that I'm trying to implement a project I've been working on for six months this week. We are up to 25 code transports (changes to existing programs) and about 10,000 lines of new code. To top it off I told customer service yesterday they had to manually update 2400 orders in the system, that got me a good butt chewing. Not a good week. :rolleyes:

Thanks,
Lincoln
 
Well, one of my rules with backup is that if all else fails change the SCSI ID. I know the feeling though, I've got a server that's about an hour and a half away with what appears to be a stuck tape in it...been dreading driving down there to retrieve it...

Sequoia
 
Dumb question but if the drives are mirrors why are you backing up both of them ???

512 is kind of light in my book but the.
Total of 72gb, compression ?? is it perhaps waiting for a second tape ??
 
I've been playing around with it again tonight.

I appears I might be loosing something on the server and maybe the backup processing is just setting it off. ???

Rich, there are three partitions on each of the drives and I'm only backing up two of them. 512 seems to be enough. It's only using about 100 to boot with all the services running and I've never seen max usage go over about 180. We're only backing up about 4.2 GB.

GSequoia, what might the SCSI Id have to do with it? I forgot to mention it's not a new system, it's been up about six months.

I'm in netmeeting on it right now so I could see the screen. I can get a response as far as the title bars being forground or background, but minimizing, the start bar, and ?? no repsonse. It also appears the clock has stopped running. It made it past the backing up point this time and stopped about 1/3 through the verify process.

The only thing that has been consistant is the backup runs about an hour and then goes to hell. I was watching the services about 10 minutes before and didn't find any services hogging memory.

:huh:

Thanks,
Lincoln
 
I am sure ya covered it... but rule out things like Temp. etc... I know nothing about NT. Only linux, sorry.
 
Not to be too critical BUT, I fix alot of NT systems that the installer decided to partition the drive, like taking a 10gb drive and giving the os 3gb and leaving 7gb on the other partition, then does some service packs and then can't even defrag it after that because of free space needing to be at least 15%, it has been defragged yes ??. It is just much easier to sell them an additional drive and keep the whole thing dedicated to the OS. How full is that system or C drive. If I'm not mistaken the native backup does not stream directly to tape but builds an image, writes to tape and then continues. I do alot of veritas netback data center installs and configs using fiber hba thru brocade switches which generally have emc nas or sans on them, in the case of NBDC a full image is created on the master and then written to tape, run out of disk on the master and all you get are failures.
As far as the scsi id's go the scsi hba can have different effects, depends on the MB, bios, hba drivers, etc. Promise hba's seem to cause the most problems. There can also be an issue as to the order the cards are put in the MB. If it's PCI I put the video card first, 1st hba, 2nd hba and finally the nic. Is the MB a REAL server MB or a ws thats been 'promoted' :D Backup can also have problems with open files, exchange files.
Is the problem duplicatable on both full and incremental backups ??
If it has been running and all of a sudden started doing this I'd swap out the hba the tape is on. How old are the tapes, IBM tells us the helical scan tapes which are used in 4mm and 8mm drives are good for about 10 full uses before oxide wear causes errors. Does the onsite person, if any, clean the drives according to any kind of schedule ?? Drive failure or errors should show up in the logs though.
 
There is loads of free space on the drives. 90% on the OS partition and 75% on the other. The tapes are all new also.

I think Mr. Fixer might have hit it. I was running all of the monitoring software I have on it last night . After running something that was processor intensive it appeared the temp on the CPU was going to spike and then it locked. The processor usage never really goes above 5 or 10% unless your doing backups or moving lots of data.

I tried to install more some better software, but it locked (same as above) each time. I'm really thinking the backup locks are a result not the problem. If they can make it through the day I'm going to take it down to one CPU (remove the suspect that is spiking) and swap the memory. I had a friend helping last night and that was his best guess also.

My reasoning is every time I've had a periphial problem like video, NIC, HD, or controler card the OS still tried to write a memory dump. It doesn't make it that far now.

Oh well, I'm going to try and run down some memory. I can't believe Crucial headquarters and plant are within a 15 minute drive and the only way I can get memory from them is Fed Ex. Where's JKTXJ, he used to work out there. I'll just have him grab some off the line. :)

If I could get caught up in my day job, I'm ready to build a Linux box and start moving things around. And Rich your coment out an on site person was just uncalled for. Someday's I feel lucky to find someone that knows what the power switch looks like.

Also, :shhh: , I'm really sick today and didn't make it to work.

Thanks,
Lincoln
 
Count your blessings, I've been running into direct hire mcse's with no experience, common sense or even an inkling of how to fix a problem. As soon as something comes up, off to the knowldege base they go and treat it like a bible. Good luck, why not up it to 2000server, you will have alot less problems.
 
We've been talking about Win2K, but haven't really had any problems with the OS. We want to change backup software first.

Crucial is letting me pick the memory up at the plant. I'm waiting on my order confermation right now.

I had problems clocking the bus up to 266 Mhz early on so I dropped it to 200 and left it at that. They think I got a bad chip and it finally started to crap out completely. My dad was watching the temp on the #1 CPU this morning and told me it was bouncing around a lot. He said the garaph for the CPU usage and heat looked almost identical. I think it has to be that.

Thanks,
Lincoln
 
Hmmm, this box that I'm posting from is a tekram with Dual 550's slot1's and both processors are the same stepping, in it, 2 256's for 512Meg and mandrake 9, it runs two instances of seti, burns cd's and lets me browse, telnet, ftp. Now the memory sticks are samsung mainly because I have had the best luck with it and I always specify from my suppliers that I want matched pairs or foursomes when ordering.
 
Just an idle thought.....

Check and make sure that *ALL* of the fans are running in the case.
 
I am NOT an IT or computer guy, but, I manage a room that all the IT runs in. I keep it at 70, if it hits 75 I call out the troops, if it hits 80 I get calls from IT "my so-and-such is not working right...is there a problem in the room?"
Remember that 80 in the room can be 100+ at the case EXHAUST.

Rev
 
Yea, we had an intermittant T1 problem, Verizon troubleshot it for several weeks, turned out the cleaning guy was unplugging the fiber channel interface at the demarc and plugging in his vacumn cleaner on to one of my UPS ckts, he was in and out in less than 5 minutes, pure luck that I was there real late one nite and observed it. Verizon came in the following week and replaced the outlets they used with a 3 prong twist lock. Cleaning guy bought an adapter, didn't speak english, finally I just ran some conduit and another circuit off the main and gave him an outlet with a picure of a vacumn cleaner taped above the box. I love my job, always something new going on...
 
Well my post last night go wiped and I can't remember what I was going to tell you.

The server should be here this afternoon. If that doesn't work I'm going to take the old server and make it a backup domain controller and then transfer everything back to it until I can figure out the broken one.

Thanks,
Lincoln
 
Well it's running again. That monitoring software is crap though.

It said CPU 1 temp was spiking. After getting it to lock and checking the bios CPU 2 was about 10 degrees hotter than 1. At 95 degrees that was still well below the lowest alert setting.

I jerked the second CPU anyway and bingo, no more crashes. I ran back to back backups all night last night, plus anything I could find that would use the CPU a lot. Everthing fine.

Also, ever since I built this machine I had to underclock it. I alway's assumed it was the memory or MB and never tried to figure it out. I clocked it back up and it's running fine. CPU temp is still staying in the mid 80's with a 76 degree room temp.

I figured I would call AMD tomorrow and see if I can't get it replaced. It's an OEM processor, but I figured it was worth a try.

Thanks for all the help,
Lincoln
 
Just curious, I stay away from amd, do amd processors have stepping similar to intel ??
You also mention that it's an OEM processor, do they have the same type warranty as intel, OEM=DOA vs retail=3 years ??
 
I've been using amd for about 8 years and this is the first problem I've had. Stepping?

Yes, the warranty is a little better than intel's. 90 day OEM and 3 years retail.

I didn't get a chance to call them today. I've known a few people that that forgot to put the heat sink on an fried the processor. AMD replaced them.

Thanks,
Lincoln
 
Stepping is based on the die's used to manufacture the processor. The dies wear out and need to be replaced, some problems are fixed and updates to the processor are done with the next set of die's and the next million or twenty are grown. Generally you need to run multiprocessors with the same stepping, at least with intel. Sometimes you can get away with different steppings but not often and surely not on a machine that has to run all the time. If you have a linux box up and running you can get the stepping out of the hardware application that lists all the hardware, win2k also has this capability. Thats one of the reasons when I build a customers multi processor system I generally try to buy a batch of processor at the same time and keep 3 spares for every 17 or so that I build. Trying to locate a replacement for 2 or 3 year old processor can be a real PIA and cost more from the speculators that stockpile them.
 
Not a clue on the stepping. When I get the chance to call them I'll ask. If they do maybe I'll just opt out and they can run on a single for a while. Or buy a couple of faster one. :)

Since I have got them running again I've been trying to get the pickup done. I've got my camper two months ago and still haven't been able to get it out. It's time to play.

Thanks,
Lincoln
 
Back
Top