LDAP send queue limits cause event 2070 and 2084

I recently worked on an issue where the domain controllers kept intentionally disconnecting the Exchange servers. The error messages that described the reason for the disconnect were rather misleading, and we ended up wasting quite a bit of time taking steps that had no chance of improving the situation. In this blog post, I’m going to document this behavior in detail, in hopes of saving anyone else who runs into this a lot of time and effort.

The Problem

The behavior we observed was that Exchange would lose its connection to its config DC. Then, it would change DCs and lose connection to the new one as well. This would repeat until it exhausted all in-site DCs, generated an event 2084, and started hitting out-of-site DCs, often returning the same error. Usually, the error we saw was a 0x51 indicating the DC was down:

1
2
3
4
5
6
7
8
9
10
11
12
Log Name:      Application
Source: MSExchange ADAccess
Event ID: 2070
Task Category: Topology
Level: Information

Description:
Process w3wp.exe () (PID=10860). Exchange Active Directory Provider lost
contact with domain controller dc1.bilong.test. Error was 0x51 (ServerDown)
(Active directory response: The LDAP server is unavailable.). Exchange
Active Directory Provider will attempt to reconnect with this domain
controller when it is reachable.

Network traces revealed that the DC was intentionally closing the LDAP connection. Once we discovered that, we set the following registry value to 2 in order to increase the logging level on the DC:

1
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\NTDS\Diagnostics\16 LDAP Interface Events

With that set to 2, the DC started generating a pair of Event ID 1216 events every time it disconnected Exchange. The second 1216 event it generated wasn’t particularly helpful:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Log Name:      Directory Service
Source: Microsoft-Windows-ActiveDirectory_DomainService
Event ID: 1216
Task Category: LDAP Interface
Level: Warning
Description:
Internal event: An LDAP client connection was closed because of an error.

Client IP:
192.168.0.190:8000

Additional Data
Error value:
1236 The network connection was aborted by the local system.
Internal ID:
c0602f1

But the first one gave us something to go on:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Log Name:      Directory Service
Source: Microsoft-Windows-ActiveDirectory_DomainService
Event ID: 1216
Task Category: LDAP Interface
Level: Warning
Description:
Internal event: An LDAP client connection was closed because of an error.

Client IP:
192.168.0.190:8000

Additional Data
Error value:
8616 The LDAP servers network send queue has filled up because the client
is not processing the results of it's requests fast enough. No more
requests will be processed until the client catches up. If the client
does not catch up then it will be disconnected.
Internal ID:
c060561

The LDAP client, in this case, is Exchange. So this error means the Exchange server isn’t processing the results of the LDAP query fast enough, right? With this information, we started focusing on the network, and we spent days pouring over network traces trying to figure out where the network bottleneck was, or if the Exchange server itself was just too slow. We also found that sometimes, the 2070 event would show a 0x33 error, indicating the same send queue problem that was usually masked by the 0x51 error:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Log Name:      Application
Source: MSExchange ADAccess
Event ID: 2070
Task Category: Topology
Level: Information

Description:
Process w3wp.exe () (PID=10860). Exchange Active Directory Provider lost
contact with domain controller dc1.bilong.test. Error was 0x33 (Busy)
(Additional information: The LDAP servers network send queue has filled
up because the client is not processing the results of it's requests fast
enough. No more requests will be processed until the client catches up.
If the client does not catch up then it will be disconnected.

Active directory response: 000021A8: LdapErr: DSID-0C06056F, comment:
The server is sending data faster than the client has been receiving.
Subsequent requests will fail until the client catches up, data 0, v1db1).
Exchange Active Directory Provider will attempt to reconnect with this
domain controller when it is reachable.

We removed antivirus, looked at NIC settings, changed some TCP settings to try to improve performance, all to no avail. Also, we weren’t able to reproduce the error using various LDAP tools. No matter what we did with Powershell, LDP, ldifde, or ADFind, the DC would not terminate the connection. It was only terminating the Exchange connections.

We eventually found out that this error had nothing to do with how fast the LDAP client was processing results, and it is possible to reproduce it. In fact, you can reproduce this LDAP error at will in any Active Directory environment, and I will show you exactly how to do it.

LDAP Send Queue 101

Here’s how Active Directory’s LDAP send queue limit works. The send queue limit is a per-connection limit, and is roughly 23 MB. When a DC is responding to an LDAP query, and it receives another query over the same LDAP connection, it first checks to see how much data it is already pushing over that connection. If that amount exceeds 23 MB, it terminates the connection. Otherwise, it generates the response to the second query and sends it over the same connection.

Think about that for a minute - it has to receive another LDAP query over the same LDAP connection while it’s responding to other queries. You can do that? Yep. As noted in the wldap32 documentation on MSDN:

The rules for multithreaded applications do not depend on whether each thread shares a connection or creates its own connection. One thread will not block while another thread is making a synchronous call over the same connection. By sharing a connection between threads, an application can save on system resources. However, multiple connections give faster overall throughput.

Until now, I had always thought of LDAP as a protocol where you send one request and wait for the response before sending your next request over that connection. As it turns out, you can have multiple different threads all submitting different requests over the same connection at the same time. The API does the work of lining up the requests and responses and getting the right responses back to the right threads, and LDAP has no problem with this - at least, not until you hit the send queue limit.

This is why we could never reproduce this issue with other LDAP tools. Every single one of those tools issues one request and waits for the response, and in that case, it is impossible to get disconnected due to the send queue limit.

The Solution

In the case of Exchange, we share the config DC connection between multiple threads. One thread would kick off a complete topology rediscovery, which involves querying for all the virtual directories in the environment. In this particular environment, there were thousands of virtual directories, and the properties on the OWA virtual directories can be relatively large. The DC would generate a response containing a page of virtual directory objects (we were using a page size of 1,000), and due to the number of properties on those objects, this response exceeded the 23 MB limit.

By itself, that wasn’t enough to cause a problem. The problem happened when some other thread came along and used the same LDAP connection to ask for something else - maybe it just needed to read a property from a server object. When that second query hit the DC while the DC was still sending us the response to the virtual directory query, the DC killed the connection due to the send queue limit.

So, how can you avoid this? As a user of software, there’s not much you can do except delete objects until the LDAP response is small enough to be under the send queue limit, or reduce the MaxPageSize in the Active Directory LDAP policies to force everything to use a smaller page size.

As a developer of software, there are a few approaches you can take to avoid this problem. One is to not submit multiple queries at the same time over a single connection; either wait for the previous query to return, or open a new connection. Another approach is to reduce the page size used by your query so that the response size doesn’t exceed the send queue limit. That’s the approach we’re taking here, and the page size used for topology rediscovery is being reduced in Exchange so that the LDAP response to the virtual directory query doesn’t exceed the send queue limit in large environments.

Note that this update to Exchange will fix one very specific scenario where you’re hitting this error due to the size of the virtual directory query in an environment with hundreds of CAS servers. Depending on your environment, there may be other ways to produce this error that are unrelated to the virtual directories.

Let’s Break It On Purpose

After I thought I understood what was happening, I wanted to prove it by writing some code that would intentionally hit the send queue limit and cause the DC to disconnect it. This turned out to be fairly easy to do, and the tool is written in such a way that you can use it to reproduce a send queue error in any environment, even without Exchange. Note that causing a send queue error doesn’t actually break anything - it just makes the DC close that particular LDAP connection to that particular application.

In order to produce a send queue error, you need a bunch of big objects. In my lab, I used a Powershell script to create 500 user objects and filled those user objects with multiple megabytes of totally bogus proxyAddress values. Here’s the script:

If you run this script, you’ll end up with some objects that look like this:

Lovely, isn’t it? I needed a way to make these user objects really big, and stuffing a bunch of meaningless data into the proxyAddresses attribute seemed like a good way to do it.

Now that you have enough big objects that you can easily exceed the send queue limit by querying for them, all you need is a tool that will query for them on one thread while another thread performs other queries on the same LDAP connection. To accomplish that, I wrote some C# code and called it LdapSendQueueTest. Find the code on GitHub here: https://github.com/bill-long/LdapSendQueueTest.

Once you compile it, you can use it to query those big objects and reproduce the send queue error:

In this example, 1 is the number of threads to spawn (not counting the main thread, which hammers the DC with tiny queries), and 50 is the page size. Apparently I went a little overboard with the amount of data I’m putting in proxyAddresses, because with these objects, the error reproduces even with just 1 thread and a relatively small page size of 50 or even 30. The only way I can get the tool to complete against these test users is to make the page size truly tiny - about 15 or less.

In any real world scenario, you can probably get away with a larger page size, because your objects probably aren’t as big as the monsters created by this Powershell script. The tool lets you point to whatever container and filter you want, so you can always just test it against a set of real objects and see.

Conclusion

The bottom line is this: When you see this error from Active Directory telling you the client isn’t keeping up, the error doesn’t really mean what it says. If you take a closer look at what the application is doing, you may find that it’s sharing an LDAP connection between threads while simultaneously asking for a relatively large set of data. If that’s what the application is doing, you can reduce the MaxPageSize in the LDAP policies, which will affect all software in your environment, or you can delete some objects or delete some properties from those objects to try to get the size of that particular query down. Ideally, you want the software that’s performing the big query to be updated to use a more appropriate page size, but that isn’t always possible.

Share Comments

Cleaning up Microsoft Exchange System Objects - part 2

In a post last month, called Cleaning up Microsoft Exchange System Objects (MESO), I described how to determine which objects can be eliminated from the MESO container if you have completely removed public folders from your environment. But what if you still have public folders?

As I mentioned in my previous post, you only need MESO objects for mail-enabled public folders. When you mail-enable a public folder, Exchange creates a directory object for it, and when you mail-disable or delete the folder, Exchange is supposed to delete the directory object. Unfortunately, that doesn’t always work like it should, and you can end up with a lot of public folder objects in the MESO container that don’t point to any existing folder.

To make matters worse, it’s not very easy to figure out which directory objects point to an actual folder. You can’t assume much from the name itself - you could have dozens of public folders all named “Team Calendar” in different parts of the hierarchy, so which directory object points to which folder?

When you send email to a mail-enabled public folder, Exchange uses the legacyExchangeDN attribute on the directory object to look up the folder in the public folder database (or public folder mailbox in the case of Exchange 2013). However, the legacyExchangeDN property on the public folder in the database is an internal property - you can’t see it, even using tools like MFCMapi. So matching them up that way is not an option.

However, you can go in the other direction. Rather than taking a directory object and trying to find the store object, you can start with the store object and find the corresponding directory object easily. This is because if you look at the MAPI property PR_PF_PROXY on the folder, the store finds the correct directory object and returns its objectGUID. This is essentially what happens when you run Get-PublicFolder \Some\Folder | Get-MailPublicFolder in Exchange Management Shell.

Thus, in order to figure out which public folder directory objects are not linked to anything, you would need to retrieve all the directory objects that exist and then determine which ones are linked to folders based on PR_PF_PROXY or the Powershell cmdlets. After you eliminate those, you know that any public folder directory objects left over are not linked to anything, and they can be deleted.

There are a few ways you could go about this. One would be to use a client API such as Exchange Web Services to enumerate the public folders and check the property that way. While I do use EWS in a lot of my scripts, there is one big drawback to using it for this sort of operation - the fact that there is no way to use admin rights via EWS. As I explained in an old post called Public Folder Admin Permissions Versus Client Permissions, it doesn’t matter what admin rights your user account has when you’re using a client like Outlook. Outlook never attempts to pass admin flags at logon, so if you don’t have client permissions to a public folder, you won’t be able to see that public folder, even if you’re logged on as an org admin. EWS works the same way - there is no way to pass admin flags via EWS. This means that if you use EWS, you might not see all the public folders, so you might erroneously delete public folder directory objects that are actually still in use.

You could work around this limitation by granting yourself client permissions to all the public folders. Another option is to use MAPI, where you can pass admin flags. Of course, writing a MAPI tool is not trivial.

A better approach is to just use Exchange Management Shell. While this can be slower than EWS, the management shell uses your admin rights, so you will be able to see all public folders in the hierarchy, even if you don’t have client permissions to them.

However, there is one other caveat to be aware of. Sometimes, public folders can have directory objects when the public folder is not flagged as mail-enabled. This is described in KB 977921. If the folder is in this state, email sent to the folder will succeed, even though the management shell says the folder is not mail-enabled. You should be sure your folders are not in this state before you start making decisions about what to delete based on what Exchange Management Shell says, or else you might delete a directory object for a folder that is actually functioning as a mail-enabled folder.

That said, I created a simple script that demonstrates how you can check for unneeded public folder directory objects using Exchange Management Shell. Note that this script only identifies the unneeded directory objects. I’ll leave the actual deletion of them as an exercise for the reader. Hint: The $value in the loop at the end is the distinguishedName of the directory object. It’s probably a good idea to sanity check the results, and you might want to export the directory objects before you start deleting things.

Download the script (You may need to right click->save as)

Share Comments

Event 9414 means your OAB is missing objects

Today, I want to highlight a behavior that isn’t really called out anywhere in any existing documentation I can find. This is the behavior that occurs when Offline Address Book generation on Exchange 2010 logs an event 9414, such as this one:

Event ID: 9414
Source: MSExchangeSA
One or more properties cannot be read from Active Directory for 
recipient '/o=Contoso/ou=Administrative Group/cn=Recipients/cn=User 1'
in offline address book for '\Offline Address List - Contoso'.

When we stumble across a bad object like this, the OAB generation process will often skip a few good objects (in addition to the bad object) due to the way we handle the bookmark. As a result, User 1, from the event above, won’t be the only thing missing from your Offline Address Book. If you turn up logging to the maximum so that OABGen logs every object it processes, you can figure out which objects are being skipped by observing which objects do not appear in the event log.

The bottom line is: If you want your OAB to be complete, you must fix the objects that are causing 9414‘s, even if the objects in the 9414‘s aren’t ones you particularly care about.

So, why does it work this way, you ask?

The 9414 event was born in Exchange 2010 SP2 RU6. Before that, one of these bad objects would make OABGen fail completely and log the chain of events in KB 2751581 - most importantly, the 9339:

Event ID: 9339
Source: MSExchangeSA
Description: 
Active Directory Domain Controller returned error 8004010e while
generating the offline address book for '\Global Address List'. The
last recipient returned by the Active Directory was 'User 9'. This
offline address book will not be generated.
- \Offline Address Book

Unfortunately, the old 9339 event didn’t know what the actual problem object was. OABGen was working on batches of objects (typically 50 at a time), and when there was a problem with one object in the batch, the whole batch failed. All that OABGen could point to was the last object from the last successful group, which didn’t really help much.

Thus, the OABValidate tool was born. The purpose of this tool is to scour the Active Directory looking for lingering links, lingering objects, and other issues that would trip up OABGen. As Exchange and Windows both changed the way they handled these calls, the behavior would often vary slightly between versions, so OABValidate just flags everything that could possibly be a problem. Which object was actually causing the 9339 wasn’t certain, but if you fixed everything OABValidate highlighted, you would usually end up with a working OAB.

In large environments with hundreds of thousands of mail-enabled objects, cleaning up everything flagged by OABValidate could be a huge, time-consuming process. On top of that, residual AD replication issues could introduce new bad objects even as you were cleaning up the old bad objects.

Finally, thanks to a significant code change in Exchange 2010 SP2 RU6, Exchange was able to identify the actual problem object and point it out in a brand new event, the 9414. In addition, OABGen would skip the object and continue generating the OAB, so that it wasn’t totally broken by a single bad object anymore. This was a huge step forward that not only made OABValidate obsolete for most scenarios, but resulted in a situation where these OABGen errors can often go unnoticed for quite some time.

When someone finally does notice that the OAB is missing stuff, and you go look at your application log, you might think you can ignore these 9414‘s since they don’t mention the object you’re looking for. However, OABGen does still process objects in batches, and when it trips over that one bad object, the rest of the batch typically gets skipped.

So if you find that your OAB is missing objects, the first thing to do is check for 9414‘s and resolve the problems with those objects. While this does take a bit of work, it’s much better than the methods you had to use to resolve this sort of issue before SP2 RU6.

Share Comments

Mailbox lock contention in Exchange 2013

In Exchange Server, when a call into the Information Store fails, we often report a diagnostic context. This information is extremely useful for those of us in support, because we can often use it to see exactly where the call failed without having to do any additional data collection. Unfortunately, diagnostic context info is mostly useless to customers, because it’s impossible to make sense of it without the source code. In this post, I’ll describe one specific thing you can look for in a diagnostic context to identify calls that are failing due to contention for the mailbox lock.

In Exchange 2013, changing something in a mailbox usually involves acquiring a lock so that other changes cannot be made at the same time. If an operation has grabbed the mailbox lock, any other operations that want to change things have to wait. They will line up and wait for the mailbox lock, and will eventually time out if they don’t get it within a reasonable amount of time. However, there’s a limit to how long the line itself is allowed to get. Once we have more than 10 operations waiting for the lock, any additional operations fail instantly with MAPI_E_TIMEOUT (0x80040401).

If you have a diagnostic context from Exchange 2013, perhaps from an event that was logged in the Application Log, then you can check for this situation by looking for LID 53152 with dwParam 0xA. Here is an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Lid: 55847   EMSMDBPOOL.EcPoolSessionDoRpc called [length=150]
Lid: 43559 EMSMDBPOOL.EcPoolSessionDoRpc returned [ec=0x80040401][length=170][latency=0]
Lid: 32881 StoreEc: 0x80040401
Lid: 50035
Lid: 64625 StoreEc: 0x80040401
Lid: 52176 ClientVersion: 15.0.775.34
Lid: 50032 ServerVersion: 15.0.775.6034
Lid: 50128
Lid: 1494 ---- Remote Context Beg ----
Lid: 53152 dwParam: 0xA
Lid: 43632 StoreEc: 0x80040401
Lid: 58656 StoreEc: 0x80040401
Lid: 35992 StoreEc: 0x80040401
Lid: 1750 ---- Remote Context End ----
Lid: 1494 ---- Remote Context Beg ----
Lid: 53152 dwParam: 0xA
Lid: 43632 StoreEc: 0x80040401
Lid: 58656 StoreEc: 0x80040401
Lid: 35992 StoreEc: 0x80040401
Lid: 1750 ---- Remote Context End ----
Lid: 50288
Lid: 23354 StoreEc: 0x80040401
Lid: 25913
Lid: 21817 ROP Failure: 0x80040401
Lid: 17361
Lid: 19665 StoreEc: 0x80040401
Lid: 37632
Lid: 37888 StoreEc: 0x80040401

You’ll notice the LID we’re interested in is at the top of the remote context. The fact that LID 53152 shows a dwParam of 0xA means that we already have 0xA (decimal 10) operations waiting on the mailbox lock, so we purposely make this call fail instantly, without waiting at all. This usually results in a MapiExceptionTimeout and a StorageTransientException in all sorts of different places.

Once you’ve identified that mailbox contention is causing the error, there’s still the question of why there is so much contention for the mailbox lock. Are there dozens of clients all trying to make changes in the mailbox at the same time? Is an application hammering the mailbox with requests? You still need to investigate to find the root cause, but after understanding this piece, you can at least begin to ask the right questions.

Share Comments

Delegated setup fails in Exchange 2013

In Exchange 2013, the built-in Delegated Setup role group allows users to install new Exchange 2013 servers after those servers have been provisioned with the /NewProvisionedServer switch. However, you may find that even after provisioning the server, when a member of Delegated Setup attempts to install the server, it fails. The setup log from the delegated setup attempt shows:

1
2
3
4
5
6
7
8
9
10
11
12
13
[11/07/2013 21:11:33.0015] [1] Failed [Rule:GlobalServerInstall] [Message:You must be a member of the 'Organization Management' role group or a member of the 'Enterprise Admins' group to continue.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedBridgeheadFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install or upgrade the first Mailbox server role in the topology.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedCafeFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install the first Client Access server role in the topology.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedFrontendTransportFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install the first Client Access server role in the topology.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedMailboxFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install or upgrade the first Mailbox server role in the topology.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedClientAccessFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install or upgrade the first Client Access server role in the topology.]

[11/07/2013 21:11:33.0031] [1] Failed [Rule:DelegatedUnifiedMessagingFirstInstall] [Message:You must use an account that's a member of the Organization Management role group to install the first Mailbox server role in the topology.]

This occurs if legacy Exchange administrative group objects exist from when Exchange 2003 was still present in the organization. Unfortunately, setup does not handle this gracefully in the delegated setup scenario.

To fix the problem, you could delete the legacy administrative groups, but we don’t recommend this. Instead, a safer approach is to simply add an explicit deny for the Delegated Setup group on the legacy administrative groups. This prevents setup from seeing those admin groups, and it proceeds as normal. After setup is finished, you can remove the explicit deny to put the permissions back in their normal state.

Setting the explicit deny is fairly easy to do in ADSI Edit, but I’ve also written a simple script to make this easier when you have a lot of legacy admin groups. The script takes no parameters. Run it once to add the Deny, and run it again to remove the Deny:

Download the script (You may need to right click->save as)

Share Comments

MapiExceptionNotFound during content replication

Today, I want to talk about another public folder replication problem we see repeatedly. Aren’t you glad PF replication is gone in Exchange 2013?

This is one of the rarer public folder replication issues that we see, and it’s caused by the attributes on the database. Actually, a database in this state sometimes causes a problem and sometimes does not, and I want to explain why that is.

The way this problem surfaces is that you see an event 3085 stating that outgoing replication failed with error 0x8004010f. If you try something like Update Content, you’ll get some error output with a diagnostic context that looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Error:
Cannot start content replication against public folder '\SomeFolder' on public folder database 'PFDB1'.

MapiExceptionNotFound: StartContentReplication failed. (hr=0x8004010f, ec=-2147221233)
Diagnostic context:
Lid: 1494 ---- Remote Context Beg ----
Lid: 19149 Error: 0x0
Lid: 25805 Error: 0x0
Lid: 11752 StoreEc: 0x8004010F
Lid: 25260
Lid: 19149 Error: 0x0
Lid: 25805 Error: 0x0
Lid: 11752 StoreEc: 0x8004010F
Lid: 25260
Lid: 19149 Error: 0x0
Lid: 3010 StoreEc: 0x8004010F
Lid: 3010 StoreEc: 0x8004010F
Lid: 3650 StoreEc: 0x8004010F
Lid: 3010 StoreEc: 0x8004010F
Lid: 3010 StoreEc: 0x8004010F
Lid: 3650 StoreEc: 0x8004010F
Lid: 2492 StoreEc: 0x8004010F
Lid: 2108 StoreEc: 0x8004010F
Lid: 18128 StoreEc: 0x8004010F
Lid: 18536 StoreEc: 0x8004010F
Lid: 18544 StoreEc: 0x8004010F
Lid: 18560 StoreEc: 0x8004010F
Lid: 18740 StoreEc: 0x8004010F
Lid: 1267 StoreEc: 0x8004010F
Lid: 33819 StoreEc: 0x8004010F
Lid: 27225 StoreEc: 0x8004010F
Lid: 1750 ---- Remote Context End ----
Lid: 26322 StoreEc: 0x8004010F

There are many problems that could cause some diagnostic output that looks similar to this. For this particular problem the error must be MapiExceptionNotFound, and the sequence of Lids will usually be pretty close to what you see here.

This error occurs when the replica list on a public folder contains the GUID of a public folder database which does not have an msExchOwningPFTree value. It’s easy to find a database in this state with an ldifde command to dump the properties of any public folder database objects where this value is not set:

1
ldifde -d "CN=Configuration,DC=contoso,DC=com" -r "(&(objectClass=msExchPublicMDB)(!(msExchOwningPFTree=*)))" -f unlinkedpfdb.txt

To fix the problem, you can either:

  1. Delete the folder, if you can figure out which one it is.
  2. Populate the msExchOwningPFTree value.
  3. Delete the database in question from the Active Directory.

Option 1 is usually not desirable, but I included it to illustrate the fact that a database in this state only causes a problem if existing folders ever had replicas on it. Keep in mind that the replica list you see in the management tools only shows you the current active replicas. The internal replica list tracks every replica that has ever existed, forever. Even if you remove all replicas from the database in question using the management tools, the GUID of that database is still present in the internal replica list, and it always will be. Thus, you cannot unlink a database from the hierarchy if any existing folder has ever had replicas on it - at least, not without breaking replication.

This is important, because certain third-party software will purposely keep public folder databases around that are not linked to the hierarchy. And that works fine, as long as they don’t have replicas, and never did.

Option 2 is the proper approach to fixing this situation if the database is still alive. Perhaps someone manually cleared the msExchOwningPFTree while troubleshooting or trying to affect the routing of emails to public folders. Just set the value to contain the DN of the hierarchy object. You can check your other PF databases to see what it should look like, as they should all have the same value. A few minutes after setting the value, replication should start working again.

If the database has been decommissioned, perhaps ungracefully, and it no longer exists, then you can go with option 3 and simply delete the Active Directory object for the database using ADSI Edit. When the GUID in the replica list does not resolve to an object in the AD, that’s fine - that’s the normal state for a folder that once had replicas on databases that aren’t around anymore, so it doesn’t cause any problem.

Share Comments

Public Folder Replication fails with TNEF violation status 0x00008000

Edit 2014-06-02: For an update on this issue, please see this post.

Edit 2015-06-09: For the most recent update, please see this post. We now have a much better solution.

In Exchange 2010, you may find that public folder replication is failing between two servers. If you enable Content Conversion Tracing as described in my Replication Troubleshooting Part 4 post, you may discover the following error:

1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)

There are other types of TNEF errors, but in this case we’re specifically interested in 0x00008000. This means UnsupportedPropertyType.

What we’ve found is that certain TNEF properties that are not supposed to be transmitted are making it into public folder replication messages anyway. These properties are 0x12041002 and 0x12051002.

To fix the problem, you can manually remove those properties from the problem items using MFCMapi, or you can use the following script.

The script accesses the public folder via EWS, so you must have client permissions to the folder in order for this to work (just being an administrator is not sufficient). Also, it requires EWS Managed API 2.0. Be sure to change the path of the Import-Module command if you install the API to a different path.

The syntax is:

.\Delete-TNEFProps.ps1 -FolderPath “\SomeFolder” -HostName casserver.contoso.com -UserName administrator@contoso.com

With this syntax, the script only checks for problem items in the specified folder. If you want it to fix those items, you must add -Fix $true to the command. Optionally, you can also add the -Verbose switch if you want it to output the name of every item as it checks it.

Edit: Moved the script to gist.github.com - easier to maintain that way

Edit: Updated the script to automatically recurse subfolders if desired. To do so, add -Recurse $true. For example, to process every single public folder, pass -Recurse $true with no folder path:

.\Delete-TNEFProps.ps1 -HostName casserver.contoso.com -UserName administrator@contoso.com -Recurse $true

Download the script (You may need to right click->save as)

Share Comments

Cleaning Up Microsoft Exchange System Objects (MESO)

Someone recently posted a question on an old blog post of mine:

Bill,

We have eliminated our public folders, and I would like to clean out the MESO folder. There are still hundreds of objects that probably serve no purpose, but I don’t see a way of determining which are still necessary.

Some examples:

  • EventConfig_Servername (where the server is long gone)
  • globalevents (also globalevents-1 thru 29)
  • internal (also internal-1 thru 29)
  • OAB Version 2-1 (and 2-2 and 2-3)
  • Offline Address Book Storage group name
  • OWAScratchPad{GUID} (30 of them)
  • Schedule+ Free Busy Information Storage group name
  • StoreEvents{GUID} (31 of them)
  • SystemMailbox{GUID} (over 700 of them)

Most of the SystemMailboxes are Type: msExchSystemMailbox, but 3 are Type: User. I found one that was created last month. Apart from the SystemMailboxes, most everything else has a whenChanged date of 2010. What to do?

Thanks, Mike

When it comes to public folders, you only need MESO objects for mail-enabled folders, and a folder only needs to be mail-enabled if people are going to send email to it. No one ever needs to send email to any of the system folders that are part of your public folder tree.

Everything in Mike’s list except the very last item is a directory object for a system folder, and even if the public folders were still present in the environment, these objects would serve absolutely no purpose. It is fine to delete them whenever you want, though if the folders themselves are still present, you might want to do it gracefully with Disable-MailPublicFolder.

The SystemMailbox objects are trickier. Each SystemMailbox corresponds to a database, and the database is identified by the GUID between the curly braces. To determine if the SystemMailbox object can be safely deleted, you need to determine if that database still exists. This is easy to do with a simple Powershell command:

1
2
3

([ADSI]("LDAP://<GUID=whatever>")).distinguishedName

Here’s an example from one of my labs. You can see that the first command I typed returned nothing, because the GUID didn’t resolve (I purposely changed the last digit). The second one did resolve, returning the DN of the database.

You could also use a simple script to check all the SystemMailbox objects in a particular MESO container and tell you which ones don’t resolve:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

# Check-SystemMailboxGuids.ps1
#
# This script checks for SystemMailbox objects that have GUIDs
# which correspond to nonexistent databases.

#####
#
# Change this to the MESO container you want to check
#

$mesoDN = "CN=Microsoft Exchange System Objects,DC=bilong,DC=test"

#
#####

$mesoContainer = [ADSI]("LDAP://" + $mesoDN)
$sysMbxFinder = new-object System.DirectoryServices.DirectorySearcher
$sysMbxFinder.SearchRoot = $mesoContainer
$sysMbxFinder.PageSize = 1000
$sysMbxFinder.Filter = "(cn=SystemMailbox*)"
$sysMbxFinder.SearchScope = "OneLevel"

$sysMbxResults = $sysMbxFinder.FindAll()
"Found " + $sysMbxResults.Count + " System Mailboxes. Checking GUIDs..."

foreach ($result in $sysMbxResults)
{
$cn = $result.Properties.cn[0]
$guidStartIndex = $cn.IndexOf("{")
$guidString = $cn.Substring($guidStartIndex + 1).TrimEnd("}")
$guidEntry = [ADSI]("LDAP://<GUID=" + $guidString + ">")
if ($guidEntry.distinguishedName -eq $null)
{
"Guid does not resolve: " + $cn
}
}

"Done!"

Share Comments

Moving from TechNet blogs to Jekyll on Windows Azure

For nearly a decade, I’ve been occasionally blogging on the Exchange Team Blog, and later on my personal TechNet blog. Those platforms are stable, easy to use, and perfectly acceptable. But they’re not much fun. I want something I can tweak, break, and put back together again.

Now that cloud hosting has become so cheap (free web sites on Windows Azure!) and managing/updating a web site has become so easy (deployment from GitHub or a local Git repository!), I’ve decided to try blogging on a platform that is basically the complete opposite of every other major blogging platform.

It’s called Jekyll, and it’s the platform used for GitHub Pages. What makes it so different is that your blog is a static site - it’s just html and css files sitting on disk, which are served up to the browser as-is. No controllers, no server-side view engine, and no database. To add a new blog post, you literally just drop a text file in a folder, and run Jekyll to update the html files. Done.

A complex content management system with an underlying database, such as Wordpress, is more user-friendly as a hosted solution. However, when you’re running the site yourself, all that complexity can make for a lot of extra work. Being able to manage my blog posts by just altering text files in a folder is pretty amazing.

Did I mention it also has code highlighting for practically every language under the sun, including Powershell? Now when I post a script that is a hundred lines long, it might actually be somewhat readable.

1
2
3
4
5
6
7
8
if ($ExchangeServer -eq "")
{
# Choose a PF server
$pfdbs = Get-PublicFolderDatabase
if ($pfdbs.Length -ne $null)
{
$ExchangeServer = $pfdbs[0].Server.Name
}

Alright, I’ve gushed about Jekyll enough. If you’re interested in a different kind of blogging platform, go check it out. Otherwise, stay tuned for more Exchange-related posts.

Share Comments