We have moved permanently! Join us @ http://forum.flexraid.com
We have moved permanently! Join us @ http://forum.flexraid.com
We have moved permanently! Join us @ http://forum.flexraid.com
[Logo] (Closed - visit http://forum.flexraid.com)
  [Search] Search   [Recent Topics] Recent Topics   [Hottest Topics] Hottest Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
Emergency! Validate reports data corruption, further tests do not confirm this. What to do?  XML
Forum Index » General discussion
Author Message
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

I always do rsynch before running validate. Usually I run quick validate.

Yesterday I decided to to a more thorough validation and... excerpt from my log:

[2009-01-17 13:18:27,562] INFO : Total process size = 2479798683
[2009-01-19 01:36:43,843] INFO : Total process size = 18893623942
[2009-01-19 12:21:34,593] INFO : Total process size = 21995238206
[2009-01-20 02:10:06,750] INFO : Total process size = 19185784529
[2009-01-28 17:54:45,437] INFO : Total process size = 139918986994
[2009-01-29 11:39:18,468] ERROR: Corrupted: G:\mystuff.mkv
[2009-01-29 11:39:18,734] ERROR: Corrupted: G:\System Volume Information\_restore{D3A7F6F1-7F50-4C8D-8EC9-654579F44EF2}\RP99\change.log.1
[2009-01-29 12:00:27,468] ERROR: Corrupted: G:\mystuff.mkv
[2009-01-29 12:58:55,406] ERROR: Corrupted: M:\System Volume Information\_restore{D3A7F6F1-7F50-4C8D-8EC9-654579F44EF2}\RP98\change.log.4
[2009-01-29 13:02:20,062] ERROR: Corrupted: M:\mystuff.mkv
[2009-01-29 13:03:27,609] ERROR: Corrupted: M:\mystuff.mkv
[2009-01-29 13:20:38,265] ERROR: Corrupted: M:\mystuff.mkv
[2009-01-29 13:44:29,062] ERROR: Corrupted: M:\mystuff.mkv
[2009-01-29 13:44:34,515] ERROR: Corrupted: E:\mystuff.mkv
[2009-01-29 14:06:58,203] ERROR: Corrupted: K:\mystuff.rar
[2009-01-29 14:14:53,531] ERROR: Corrupted: K:\mystuff.flac
[2009-01-29 14:32:16,000] ERROR: Corrupted: M:\mystuff.mkv
[2009-01-29 14:36:38,078] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 14:39:25,859] ERROR: Corrupted: E:\mystuff.mkv
[2009-01-29 15:24:55,281] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:29:22,593] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:31:54,171] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:35:16,890] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:37:35,156] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:40:53,718] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:42:52,015] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 15:50:22,906] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 16:00:04,421] ERROR: Corrupted: E:\mystuff.mkv
[2009-01-29 16:12:33,875] ERROR: Corrupted: K:\mystuff.cbr
[2009-01-29 17:55:15,671] ERROR: Corrupted: K:\mystuff.flac
[2009-01-29 17:58:31,859] ERROR: Corrupted: K:\mystuff.xls[b]


Now... at the beginning are all the rsynch processes run in the past few days. Then there's the corruption reports but... the video stuff plays with no problem. The RAR file (which is kinda big, over 400MB with multiple files inside) tests correctly (so no corruption)... the comics stuff (cbr) displays correctly. The XLS opens and works perfectly.

The corruption reports seem clearly wrong. I could understand movies having only a few frames corrupted and me not noticing them but Excel and RAR are picky when it comes to damaged stuff.

What could have happened? What is the right course of action in cases like these? Should I rebuild the whole array from scratch (I sure hope not)?

I had forgotten System restore on these drives, but those are hidden files, according to the documentation should be irrelevant.

Also, supposing one of the files was corrupted for real (it is not as far as I can tell), does one have to rebuild the whole DRU or you just delete what's reported as corrupted and rebuild the DRU). In case of corruption spread on multiple DRUs (like this case, even if the corruption is not true)?

Thanks for your answers, I'm kind of uneasy with this sort of errors creeping up. I use this solution to feel safe, not the opposite.

This message was edited 2 times. Last update was at 30/01/2009 10:46:45

Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

Are you running an anti-virus?

The only time I have seen false corruption reports has been because of AV interference.
I would configure your AV to ignore FlexRAID activities.

I would shutdown the AV and run a verify on the RAID.

In the case of a real corruption, I would move the offending files somewhere else and rsynch (no need to re-create).

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

Thanks for the answer. I am running AntiVir, yes. I don't know if I can configure to ignore specifically FlexRAID activities. I'll have a look.

Again, thanks for the quick answer, I really appreciate it!

But in case files are corrupted, how do you recover them? Delete corrupted files and restore DRUs? Will FlexRAID rebuild only the missing files?

Also, let's say I run rsynch today. Tonight I add some files and FlexRAID has not rsynched yet. One DRU fails. Will I lose only the new files I have not synched or the whole array would be compromised? These two points are really unclear in the documentation.

This message was edited 1 time. Last update was at 30/01/2009 11:10:50

Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

gorman wrote:Thanks for the answer. I am running AntiVir, yes. I don't know if I can configure to ignore specifically FlexRAID activities. I'll have a look.

Again, thanks for the quick answer, I really appreciate it!

You are welcome.
Which anti-virus are you using?
If you cannot exclude processes, I would either disable realtime system scan on the FlexRAID DRUs or setup the realtime scan to only scan "modified" files.
Then, you can schedule manual scans on the DRUs during times FlexRAID isn't running.

gorman wrote:
But in case files are corrupted, how do you recover them? Delete corrupted files and restore DRUs? Will FlexRAID rebuild only the missing files?

I would move the corrupted files outside of the DRU and do a restore (do this one DRU at a time).
Yes, FlexRAID will only rebuild the missing files.

gorman wrote:
Also, let's say I run rsynch today. Tonight I add some files and FlexRAID has not rsynched yet. One DRU fails. Will I lose only the new files I have not synched or the whole array would be compromised? These two points are really unclear in the documentation.

Only the recently added files (since the last rsynch) would be at risk.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

As I mentioned, I run AntiVir, which I discovered allows process to be excluded from the Guard thing.

Good. Will rerun verify and report back. Thanks for all the good and quick info.
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

It wasn't that. Still getting file corruption reported for files that are not corrupted. RAR files that test out correctly, FLAC files that test out correctly (against the internal checksum)...

I substituted two hard drives two weeks ago (1.5TB Seagates with 1TB WDs) but I copied all the files and FlexRAID should ignore system and hidden files, right?

Could different cluster size matter in this case?
Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

You mean, you disabled your AV and still got false corruption reports?

I would try disabling the whole AV and testing.
Then, you can configure the process exclusion thing if the AV was the cause.
If you have ZoneAlarm or some Trojan horse protection utilities, disable them too (just to see).

FlexRAID will ignore hidden files but not hidden folders.
Non-hidden files within hidden folders will be picked up.

The is no question that some disk utility is interfering.
The real question is: which one?

This message was edited 1 time. Last update was at 31/01/2009 09:07:23


Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

I don't know what to say. I disabled the antivirus and got the same corruption reports.

I would even try to move one of the reported files and repair the DRU but... considering that something is messed up I'm *really* afraid that this will destroy some other files...

And yet again, the "corrupted" RAR tests perfectly fine and this time I've even extracted files from it. Everything perfect.

Edit: decided to try this approach. I'm gonna move all the "corrupted" files on a non FlexRAID drive. Rsynch. Move them back in position. Rsynch. Validate.
Let's see if the "false positives" stop.

This message was edited 1 time. Last update was at 01/02/2009 07:54:55

Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

Indeed, this is weird.

Yeah, I would definitely move the flagged files out of the DRU, rsync, and then move them back.

If the same files are being consistently reported, then maybe there is something to it.
If the rar files have recovery record, that might mask any real corruption.
Move video files can withstand a few corruptions here and there.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

Moved files, rsynched. Moved files back, rsynched. Validated.

10 files, all among the ones previously reported, are still reported as corrupted. One of them is the .rar, still testing out perfectly. Another one is one of the two .flac, again testing out perfectly against it's own CRC mechanism (and here I am sure there's no redundancy of data whatsoever).

I'm using the WebUI, if that matters.

This message was edited 2 times. Last update was at 02/02/2009 02:15:02

Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

Are any of these files small enough for you to upload them somewhere where I could download them?

I sure would like to know what's up with those files.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
gorman


Joined: 03/11/2008 09:57:22
Messages: 166
Offline

"Unfortunately" the .cbr and .xls files are not reported as corrupted anymore. The smallest file is the FLAC one. It's a DTS, FLAC encoded file. 256MB. If you wish to advise me through a PM where I could put it for you, I'd gladly do it.
Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

PM sent.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
Vaskill


Joined: 11/01/2009 15:38:31
Messages: 50
Offline

I think I might be having the same problem.

I started getting Validation errors on a single drive. Okay, so I schedule a bad-sector scan. No errors found. So I backed-up the file as "corrupted" and restored it from Parity. After a successful restore, there is a size difference between the two files and the newly restored file passes a new validation pass. Okay, but now I have 8+ files over 3 of 5 drives of the array. So I do a bad-sector scan on all, including the Parity. No bad sectors found.

At this point, I do not want to keep restoring files with out figuring out what the underlying problem is.

Any suggestions?
Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

Vaskill wrote:I think I might be having the same problem.

I started getting Validation errors on a single drive. Okay, so I schedule a bad-sector scan. No errors found. So I backed-up the file as "corrupted" and restored it from Parity. After a successful restore, there is a size difference between the two files and the newly restored file passes a new validation pass. Okay, but now I have 8+ files over 3 of 5 drives of the array. So I do a bad-sector scan on all, including the Parity. No bad sectors found.

At this point, I do not want to keep restoring files with out figuring out what the underlying problem is.

Any suggestions?


To me it seems like the file got silently corrupted.
Corruption does not always occur from bad sectors.
This is where most people are uninformed.

Often, corruption will come from memory corruption.
This is why it is recommended to run ECC system memory.
Memory corruption will can be benign or it could cause your system to crash or it could lead to silent data corruption.

My personal server uses 4 bit parity ECC (Chipkill) protection for that reason.

The fact that the two files are of different sizes does concern me and does not correlate with how the application works.
Something tells me that the file in question was in the middle of being changed by another process when it was being validated.

Can you post your logs?
Please make sure to not confuse files marked as changed and those marked as corrupted.

I will add a new feature that will only validate specific files (as specified by you) for the next release.
This is just in case the files reported are false positives.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
mheloy


Joined: 04/09/2008 00:56:08
Messages: 44
Offline

It seems I have a similar problem... I will try to disable my Anti virus then run rsynch and then verify

thanks
Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

Everyone should definitely configure their anti-virus to ignore FlexRAID's activities or set it up to exclude the DRU paths from paths protected in realtime (scan the DRU paths only in manual mode).

Some anti-virus will degrade FlexRAID's performance and/or corrupt some of the read operations.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
Vaskill


Joined: 11/01/2009 15:38:31
Messages: 50
Offline

The enhancement you identified would be FANTASTIC for troubleshooting "data creep" issues!

This is interesting as I have never heard of silently corrupted! I will start another thread where, maybe you educate me on this phenomena (ie. RAM causing errors on already written data).

As for my logs, no problem as usual, but because I Rsync every 4 hours, and only Validate once a week with Warn level debugging, what level would be most useful to you?

Enhancements for logs and possible expansion for email notification from web interface:
It would be great if you could add some basic enhancements for logs. This information would just make it much easier to navigate and understand performance/issue items:
///
Rsync Start
Rsync New/Changed Files
Rsync Total Size
Rsync End
///
Validate Start
Validate End
///

I have attached a log at Warn level covering two Validates. I am not (as far as I know) confusing the word "corrupted" with the word "changed" but lets have a look.

This message was edited 1 time. Last update was at 06/03/2009 17:33:21

Brahim


Joined: 09/04/2008 23:28:33
Messages: 2883
Offline

1. On data corruption... read here: http://blogs.zdnet.com/storage/?p=191
In summary, data corruption on already written data due to memory corruption is really at the filesystem level (as opposed to at the cluster level).

2. To see more logs, you need to set the log level to "DEBUG".

I will look a little into these corruption reports.
For now, you should backup those files flagged as corrupted to a separate directory and restore them.

Server (VMware ESXi): dual Quad 8356@2.4Ghz | ASUS KFN5-D SLI | 16GB (4x 4GB) DDR2 667Mhz ECC REG w/Parity [Chipkill] | Radeon X300 | Intel 160GB SSD (VM datastore) | 6+ TB storage
File Server VM (running FlexRAID): 512MB RAM | 2 vCPUs | 6TB storage | Parity on 2TB NAS
Vaskill


Joined: 11/01/2009 15:38:31
Messages: 50
Offline

1. Thanks for the link. I actually have read that, but what I do not understand from that article is, the cause. I know it happens, but what causes it (and thus, why is it happening to me, if it is)? I do not see how good, tested RAM causes data fatigue on already verified data on a hard disk.

2. Thanks, just needed to know what level you needed for this type of error. A validate at DEBUG level can be a bit overwhelming when all you may need is the INFO level.

3. Okay, some more fuel for the fire:

Manually invoked an Rsync. Success.

Backed up my first corrupt, deleted the original and restored it from Parity (all my files are large 8GB+ video files). The backed up file is 7,829,088KB, the restored file is 7,827,400KB. Both play fine with the same Time Code on the last frame, but the restored file is 10 frames less in total than the "corrupted".

Backed up my second corrupt file, deleted the original and restored it from Parity. The backed up file is 9,379,408KB, the restored file is 8,905,317KB. Original plays fine with the Time Code on the last frame used to check completeness, but the restored file is badly corrupted with only 24.75 minutes of viewable film, more than the "corrupted".

I stopped there. I think this may be an issue of Rsync in the middle of a recording? Does this make sense? Your "skip file if still being written too or in use" (identified in another post) is a feature that certainly correct this.

This message was edited 1 time. Last update was at 07/03/2009 16:20:21

 
Forum Index » General discussion
Go to:   
Powered by JForum 2.1.8 © JForum Team



Locations of visitors to this page