Tuesday, December 22, 2009

SharePoint 2007 BLOBS

Hi All,

As we know that SharePoint 2007 is web-file system. It has particular folder structure, it has files and on top of it, it has site model and an associated API. In the hierarchy, we have lists, folder and we code against them.

The SharePoint product and technology 2007 recommends 100 GB of content database per site collection as the best practice for WSS v3 and MOSS 2007 deployments. SharePoint is based on SQL Server 2003/2008 and studying the database is worth and it always better to optimize our farm.    A very common question arises here??

Q. Why SharePoint 2007 database is restricted to 100 GB?
A.   To explain, firstly we have to understand SharePoint 2007 database.

To dissect this issue, we need to understand SharePoint 3 main databases:

Farm (such as SharePoint_Config):  It has tables like Servers, Globals, Web-Apps, WSP Solutions. Its related Central Administration.

Web App (such as WSS_Content_80): It has tables like Sites, Webs, Workflow, WebParts, Roles, Permissions,  Recycle bin, All Documents, Features, Event Receiver, Lists, Pages etc.

SSP (such as SharedServices): It has tables like Serach (properties), SSP  (My site host config, Profiles, BDC config, Excel Services) etc.

Please refer to a detailed and crisp article on "Understanding on SharePoint database. (by Joel Oleson). 

Let's discuss this in detail: 

1) SharePoint Config: 

Its the soul of the farm. It relates to "Central Administration". All the global settings, WSP solutions, timer jobs and configuration are stored here.  When are doing some updates on Operations tab, we are apparently making changes on config database.
The config db knows all of the server names of the servers in your farm, databases, and it knows the web applications by name.  Even down to the site collection, it keeps track of what site collections are in what databases.

It is to be noted if we have multiple servers, then each server will have one "SharePoint_Config". Moreover, if we have a single sever configuration, then we will have just ONE "SharePoint_Config" 

2) Content Database:

This is most important database which we have to focus for this post. All the documents which we upload through SharePoint UI are stored in content database. There are lot of information with respect to the meta data, content types, other tables including context information like webs and sites. For our discussions with respect to is a table called: "AllDocStreams" which has all BLOBS and "AllDocs".

All the documents and files which are uploaded or attached sits on "AllDocStreams" table. Few others points to noticed:

>> This table is the one which stores ALL documents and files.
>> One of the column, "Content" store document.
>> "Content column" is of "Binary (image)" format.

We can also observe that there are few important columns such as:
>> ID           -  The document identifier of the document.
>> SiteID     - The Site Collection Identifier of the site collection containing the document.
>> Size       - The size of the document stream, in bytes. For a ghosted (or uncustomized) document, this can be NULL.

Its quite essential to understand that WSS file system has to 2 main tables namely:
  1. AllDocs - A file system itself. 
  2. AllDocsStream - Binary of uploaded files. 
“Customized” or “Unghosted” pages are saved into WSS File System instead of hard drive. This document shows the way of how to return them back to the physical disk.

So,we have observed that SharePoint database has table which stores all the documents which uses uploads into SharePoint UI and that table is "AllDocsStream".

What is a BLOB? 
A BLOB differs from regular database data in that it is not forced into a certain structure. A large proportion of data files are represented as unstructured binary files—that is, binary large object (BLOB) data files.

However, a large proportion of data in a typical enterprise deployment of Windows SharePoint Services (WSS) is stored as unstructured, binary data streams (BLOBS) associated with SharePoint files. These binary streams, or BLOBs, which correspond with SharePoint files, often comprise a majority of your data.

In short, BLOBs are Binary Large OBjects – a container of unstructured bytes of data. Particularly, when users search for a document they do not look for a document as a whole but for a category or a particular column i.e. querying the  from the database.

SharePoint data that is not meta-data (documents – most other list items are completely meta-data) is stored in BLOBs in SQL databases.  BLOBs typically account for 60-70% of all content storage.  Most SharePoint operations act against the meta-data, not the BLOB data – until you go to click on the link and open the document.

We have a rough knowledge of BLOBs, they just are unstructured data and users really do not query the whole document but they look of some kind of categories or tagging.

Problems with BLOBS when content database reaches 100+ GB 

SharePoint has a strength in the document management and cross-team collaborations i.e. Enterprise Content Management (ECM) solutions. When it comes to large scale ECM solutions, SharePoint lacks on this area.

I have read a beautiful article by "Michael Nemtsev" and he details about large scale ECM solutions.

A small snippet of his article:

Requirement 1: Large ECM scenarios can require to store millions documents in the single folder or document repository, including deep, nested folding structure to accommodate all of the organization’s documents.
Issue: SharePoint 2007 has limitation of sub-folders support, number of documents stored in list libraries, crawling limitations and etc
- SharePoint 2010 solves such issues, and you can store millions items in folders and sub folders without impact on querying data, rendering and search

As he argues that SharePoint 2010 solves this. I have to still create an environment to simulate set up.

Apparently, when we have large ECM and content database swells to more than 100+ GB, we can choose 3rd party components such as "Documentum" and "Open Text". And following is their strategy:

Typically SharePoint may have multiple content database and each content database is limited to 100 GB database. What these third party components capitalize the fact that content data has 60-70% of the storage volume in these databases is binary large object (BLOB) which is not necessary for queries to the database,  and thereby unload this BLOB content to an external storage device. Using this technique, we amplify the content database and site collection limit to far more than 100 GB (< 400-500 GB).

SQL storage is inherently expensive and more data existing in SQL which performance load there to retrieve it. Particularity, in case for large data sets and their recovery/backup is slow. In SharePoint 2010 comes for the rescue with something called as "RBS (Remote BLOB Storage)".

In SP 2007, we have to rely on the 3rd party complements (EBS (External BLOB Storage)) and they were responsible for both managing external BLOB storage and creating the API libraries to interface with SharePoint. Microsoft still supports but its deperecated.

RBS is fully managed code, can be scoped to individual content databases (instead of at the farm level), can be configured and managed via PowerShell, supports many providers (including third-party), and supports migration both ways.

Check out RBS team blog to get more information.

Concluding this post, SP 2007 could not support large ECM and it relies on extrenal vendors through SharePoint API. In new flavour of SP 2010, they have overcome this issue and RBS team seamlessly interact with SP 2010.


Sky Soft
Michael Nemtsev (ECM solutions)

1 comment:

mithra said...

Interesting blog. It would be great if you can provide more details about it. Thanks you

Sharepoint Remote Blob Storage

How to: Use cascading drop-down lists in PowerApps

Hi all, Using cascading dropdown, users can easily fill the forms by selecting drop-down values dependent on values from another dro...