replication hangs on one table

publisher - sql2000 sp4 distributor - sql 2005 sp1 subscribers - mix of sql2005 sp1 and sql 2000 sp4 we have a database with around 300 tables and replicate it to around 10 different subscribers. some in the same datacenter and others in different offices. We have around 50-60 different publications for this. Some have 20 or so tables others have only one table if they are large tables. Tables range in size from a few hundred rows to over 20 million. Some tables replicate a few commands, others 100000 or more commands on a daily basis. Around 6 weeks ago we started having problems with one table. It's 1.4 million rows and replicates around a few thousand commands on a daily basis. We saw a backlog of around 150000 to 400000 commands. We had some replication issues at this time with this distributor and we traced these problems to memory errors and the replication job would not start or stop. After restarting the job it pushed all the commands through to the subscribers. A few weeks ago we started noticing another backlog again and this time it wouldn't clear. A few times we just ran another snapshot at night but then we started to investigate. We noticed that some people were doing mass updates of 30000 or more rows and this time they weren't going through. It would just build up and cause blocking on the subscribers along with spiking the CPU to 50% or more on a constant basis. As soon as we turn off the replication for this publication everything goes back to normal. Meanwhile all the other publications are happily replicating tens of thousands of rows with no problems. One thing I noticed is that these updates cause a command to transaction ratio of thousands to one where everything else is less than 10 to 1. I thought this was the cause but then the senior DBA updated around 20000 rows in another publication in one update and it went through within a few minutes. Microsoft is absolutely no help. They are looking through logs and they tried to tell us to upgrade our hardware but we have some ancient subscribers for other busy publications with no issues. Right now we are thinking that it might be corruption on the publisher table and we'll be creating a new one this weekend. Any other ideas

Answer this question

replication hangs on one table

  • plaszlo

    As indicated in the doc I linked above, and in my last post, using SQL for article properties is very slow. You need to use stored procedures, not SQL. If you profile the subscriber, you'll find a lot of recompiles as well, which is most likely slowing everything down.
  • Andrew Moiseev

    running a test now in our testing environment. i restored a copy of the db from last week to make sure it was more current.

    I think the problem is at the subscribers. I'm running a profile session and i'm getting a 2388 on the CPU each time it processes the command and 10350 or so reads to a write. Going to make sure it's an index seek once it goes through. only updating 11000 rows.

     

    the query is

     

    update table

    set billdate = 3 where resellerid = 2130

     

    on the subscriber it's

    update table

    set billdate = 3 where sid = xxxxxxx

    SID is the primary key for the table

     

    i'm also going to try to simulate this on the publisher by first selecting everything into a temp table and then run an update against all the SID's in the temp table to see how long it takes on the publisher.


  • My Vizai

    If SID is the primary key, then updateing the primary key itself will send across a delete/insert pair, and not an update statement.
  • Steev

    should have been more clear

     

    we have 50 or 60 publications on this one distributor and a few transfer 200000 commands a day or more. Only one publication is having problems. All the others are running fine. We also did a test update of updating 80000 rows yesterday in another publication and it went through in minutes. Same db, same distributor and same subscribers as those in the problem publication.

    last night i did a test on our test environment with late 1990's hardware. Table in question hangs when you run a mass update. Another table I did a 1.2 million row update and it went through in 3 hours. in production our problem table gets hung up on 10000 row updates.

    this is all distributor to subscriber. logreader works fine and the publisher to distributor works fine as well. takes a minute or so to get all the commands to the distributor.

    i've looked at the subscribers without the problem publication running and resources are OK. CPU is single digits and there is enough RAM. Once the problem publication runs the CPU spikes to 50% or more depending on the subscriber.

     

    distributor has 4GB ram and there is close to 1GB free. as i said before, all the other agents run with no problems. Around 50-60 publications, 10 or so subscribers and I think it's close to 300 total jobs in the job monitor. They all run on varying schedules from 5 to 15 minutes depending on the volume of commands for that agent. I've seen backlogs sometimes but when i restart the agent they take seconds or minutes to clear for 5000 to 80000 commands. In this one publication we start seeing problems if we reach a backlog of 2000 commands

    nothing changed except for the distributor back in late december to early january. but we scripted all our replication and set up the agents manually just as it was on the old SQL 2000 distributor for the last few years. We had some problems in january but i set up all the jobs on a schedule instead of running continously and it fixed our desktop memory heap issues and everything has been running fine until 2 weeks ago or so.

     

    Even on this publication in question one time around a month ago i messed up the schedule and it wasn't running for a day or so. 180000 commands in the queue. Took maybe 30 minutes or less to clear when i restarted it.

     


  • RMORAR

    but why is it not happening when we use sql for replicating other tables and large batch updates i ran a 60000 row update on another table in our test environment and replication was set up for sql statements. it's running at normal speeds we are seeing in production for this table and other tables other than the problem table.

     

    i'm runnning a profiler right now and the sql 2005 subscriber for the temp table we created from the data in the production one is also seeing large amounts of reads and we are using stored procs for it's replication

     

    we had used sql for replication for years with no issues and for large updates. we started the stored procs on the advice of a PSS engineer from a case that we had open last month on a totally different issue. the size of the database and the table didn't grow that much in the last year.


  • SekharPC

    I can see one reason why you're getting slow performance. Why are you not using the custom stored procedures article properties Using SQL (which uses sp_executesql) is very expensive, you leave little chance for plan reuse. Have you read the performance doc I linked above


  • Abongs

    Did you rule out hardware - have you run perfmon to ensure disk is not bottlenecking Have you been monitoring SQL Server memory and machine memory How many replication agents is your distributor machine running What hardware is your distributor - memory, CPUs, disk layout, etc.

    You mention there's a backlog of 150k to 400k commands - you didn't say which agent is falling behind - logreader agent or distribution agent

    You mention you see updates causing a cmd to txn ration of thousands to one. There are reasons for this - every row that is touched by an update statement is sent as one update statement. If you have one update statement that touches 20k rows, 20k update statements will be sent to the subscriber. If you update primary key, it will send delete/insert pair for every row touched.

    If you're getting blocking at the subscriber, then you need to monitor the activity at the subscriber to see why there's blocking. If you're executing a batch update of 20k commands, and it's all in one transaction, then the distribution agent will not commit the transaction until all 20k rows have been successfuly applied. So yes, there could be blocking of read/access to rows while they're being applied.

    There's a basic perf doc you should read if you haven't done so already - http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/tranrepl.mspx.

    9 times out of 10 if there's performance problems, it's at the subscriber, and it's because of large batch updates as you mentioned above conflicting with other read activity, indexes, triggers, etc. Distribution agent commands being applied are very optimal - clustered index updates and deletes - can't get anymore optimal than this.


  • Gromlir

    still not done testing, but here is what i have so far

     

    Testing -

    publisher - sql 2005 sp2

    distributor - sql 2005 sp2

    subscriber - sql 2000 sp 4

    fully restored production database from late last week and replicating real tables from restored db with all relationships in place

     

    problem table using sql statements - 10300 reads per each command

    another table i ran an update on replicated using sql statements - 75 reads per command

    in both cases the snapshot with all indexes was done last night or this morning to avoid fragmentation

     

    Production -

    publisher sql 2000 sp 4

    distributor sql 2005 sp 1

    subscribers - mix of sql 2000 sp4 and sql 2005 sp1

     

    real problem table with all relationships with sql statements - slow today with 10300 reads per command

    copy of problem table with no relationships using stored procs - 4500 reads per command on sql 2005 subscriber (some commands go faster and i ran alter index on all the indexes on the sql 2005 sp1 subscriber before the update. The sql 2000 subscribers where i didn't defrag before the update seem to be going a lot faster at the speed of the other tables)

    daily replication of other tables using stored procs - around 75 reads per command

     

    My next test is going to find some sql 2000 servers i can use in another test environment and try to replicate this again on pure sql 2000.

     

    doublechecked and we have indexes on the columns being updated. we checked the execution plan and it's an index seek

     


  • Guy Baron

    but we are seeing a huge difference in performance on the same machine in the same db. reason for sql is i said above. we had procs, but had the same problem and switched to sql to make it easier to read the verbose logging. we have replication running on a table that we created for testing from the prod table and even with the procs it's running 50 times slower than other agents from the same publisher over the same distributor and to the same subscribers which are also running a much higher volume of commands on a daily basis

     

    ran checkdb over the weekend on a restored copy of the db and it is OK along with the distribution db

     

    in the test environment the subscriber will fly on one table and crawl on the same problem table as in production with both agents using sql commands to replicate data. db is the same because it's a full restore from production. one table i can do a 50000 row update and it will replicate in 10 minutes to the subscriber. the problem table in the same enviroment with the same subscriber will have a huge amount of reads per every command. and this is right after a snapshot and fresh index creation in both tests.

     

    even in prod we see a big difference in speed on our copy of this one table compared to other tables being replicated even after a fresh snapshot and an alter index to defrag all the indexes just in case

    since we didn't see this issue when our environment was 100% sql 2000 i'm going to try to get a few machines in a replication environment to try to replicate this with at least a sql 2000 distributor

     

    i also tried the update statements in our test environment on the publisher just like the distributor is sending to the subscriber and they went through with no problem and the profiler showed a maybe 100 reads per command. why such a difference when the indexes should be in better shape on the subscriber

     and trying this on the subscriber the same command goes through via management studio with a few reads, but when the distributor sends it there is a huge backlog.

     

    wrote a cursor to build a bunch of updates in our test environment just like the ones i see via the profiler being applied at the subscriber. I run them through management studio on the publisher and the subscriber and they go through fast and profiler session backs me up. make a value change to bill_date column and apply at the publisher again and when the same command is replicated it goes slowly again and shows over 10000 reads and a high CPU compared to a few reads via management studio. i ran profiler on the publisher and the same command was applied 100 times faster on the publisher

    something peculiar is going on


  • OmegaMan

    i'll try that out

     

    i had verbose logging turned on and tried a few of the execution plans for a few of the updates on one of the subscribers and they were all index seeks or clustered index seeks. i'll check the pss diag i had running as well since i think it also had a profiler.

    only thing i understand is why only that one publication i checked the index fragmentation on that table at the subscriber and they are all less than 5%. we replicate indexes as part of the replication and even the day after a new snapshot we saw a huge backlog once someone does an update. I also don't understand why this is happening on six different subscribers 5 of which are SQL 2000.


  • amritpal singh

    YOu need to identify two things:

    - whether slowdown is due to distribution agent reading from distribution db or from writing to subscriber.

    - what the update consists of. How many rows, how large of a transaction, whether the subscriber db is growing or not, etc. If this is a transaction with a million changes, that could equate to a lot of tempdb storage at the subscriber. Did you check to see if tempdb is growing Are the BLOB columns being updated Did you try breaking the publisher update job into smaller batches

    You also mention distribution agent hangs. How did you confirm this Is it really hanging, or is it applying commands at a very very slow rate I also want to caution you on using that tab in replmonitor, the one that shows how many commands are backlogged in the distribution database - that's a very expensive query it executes, it does a count(*) on the MSrepl_commands table, and can take up a considerable amount of resources.


  • BorisAro

     Greg Y wrote:
    If SID is the primary key, then updateing the primary key itself will send across a delete/insert pair, and not an update statement.

     

     

     

    exec sp_executesql N'update [dbo].[table] set [Bill_date] = 3 where [SID] = @P1', N'@P1 nvarchar(9)', N'zxxxxxxxxx'

     

    this is what i'm seeing in the profiler on the subscriber and i double checked and sid is the PK. indexes on the subscriber should be OK because i just ran a snapshot

     

    after this is over i'm also going to run a 10000 or so row update on another table and see how that looks in the profiler. all the other tables seem to take huge batch updates with no problem.


  • shizuka.a

    If it's the distribution agent that's the problem, then it has nothing to do with a corrupt publication as you suspected earlier, regenerating that publication will not fix anything.

    So it looks like you've already identified the problem - when a distribution agent begins applying a large batch of commands for a given publication, CPU spikes. If this is true, then you should definitely be able to figure out the problem by running a profiler trace at the machine for that particular agent to see exactly what's causing this. If the CPU is spiking at the distributor, then it's the proc/query that's reading the commands from the distribution database, if it's at the subscriber, then it's the sp_MSins/upd/del* procs being applied at the subscriber db.

    Regardless, profiler should be able to tell you something. It could be a bad plan, db autogrow, tempdb growing, etc. If you have time, maybe you should try breaking the batch update into smaller batches, maybe that will fix the problem. You could also look into replicating the execution of a stored proc, this way you don't fill up the distribution db with all those commands.


  • RayClark096

    If you want to know why it's slow on one machine but not another, then you need to do some investigation. Obviously the distribution agent does nothing more than apply the SQL command, that's it. It's up to you to profile the subscriber to see why it's slow, I mentioned this many times in my posts above. If you want better performance, then use stored procedures, not SQL. This is well documented. You never mentioned why you're using SQL, but if there's no business reason to do so, then don't use it.
  • replication hangs on one table