Gathering a detailed dataset on your SharePoint Sites


A step-by-step guide to getting SharePoint Sites information using the
Microsoft Graph Data Connect (MGDC), Azure Synapse and Power BI

0. Overview

This blog shows the step-by-step instructions on how to extract SharePoint and OneDrive site information and use that to run analytics for your tenant. If you follow these steps, you will have a Power BI dashboard like the one shown below. That includes total bytes per site type, number of sites by owner, and total file count by month created. You can also use the many other properties available in the SharePoint Site dataset.

Sample Dashboard

To get there, you can split the process into 3 distinct parts:

  1. Set up your tenant for the Microsoft Graph Data Connect, configuring its prerequisites
  2. Configure and run a pipeline to get SharePoint Sites using Azure Synapse
  3. Use Power BI to read the data about SharePoint Sites and show it in a dashboard

1. Setting up the MGDC

The first step in the process is to enable the Microsoft Graph Data Connect (MGDC) and its prerequisites. You will need to do a few things to make sure everything is ready to run the pipeline. Here’s what you will need:

  • Create an approvers’ group for MGDC. This is a mail-enabled security group that will have the list of people that can authorize a new MGDC data request. This group will authorize MGDC requests.
  • Enable MGDC in your Microsoft 365 Admin Center. This is where your Tenant Admin will check the box to enable the MGDC, point to the approvers group and enable the use of SharePoint datasets.
  • Create an application identity to run your pipelines. This is an application created in Azure Active Directory with the right permissions to run your pipelines and write to your Azure Storage account.
  • Create an Azure Resource Group for all the resources we will use for MGDC, like the Azure Storage account and the Azure Synapse workspace.
  • Create an Azure Storage account. This is the place in your Azure account where you will store the data coming from your pipeline. This is also the place where Power BI will read the data for creating the dashboards.
  • Grant the application identity the required access to the Storage account. This makes sure that the application identity has permission to read and write to the storage.
  • Finally, create a container and folder in your Storage Account. This is the location where the data will go.

Let’s look at each one of these.

1a. Create the MGDC Approvers’ Group

You will need to create a mail-enabled Azure Active Directory security group. The group will have the list of people in your tenant that can authorize an MGDC pipeline to run. Here are the steps:

  • Navigate to the Exchange Admin Center at https://admin.exchange.microsoft.com/
  • Select the option to manage Groups on the left
  • Select the option to Add a group
  • Select the option to create a Mail-enabled security group, then click on the Next button
  • Enter the group name and description, then click on the Next button
  • Assign owners to the group, then click on the Next button
  • Assign members to the group, then click on the Next button
  • Enter the group e-mail address and check the box to require owner approval to join the group, then click on the Next button
  • Review everything and click on the Create Group button
Create Group

1b. Enable the MGDC

The final preparation step is to go into the Microsoft 365 Admin Center and enable the Microsoft Graph Data Connect.

  • Navigate to the Microsoft 365 Admin Center at http://admin.microsoft.com/
  • Select the option to Show all options on the left
  • Click on Settings, then on Org settings, then find the settings for Microsoft Graph Data Connect
Org settings
  • Check the box to turn MGDC on and enter the Approver’s group you created in step 1a
MGDC settings
  • Make sure to check the box to enable access to the SharePoint and OneDrive datasets
  • Now you must wait 48 hours for the data to become available for your tenant.
    You can continue with the configuration, but do not trigger your pipeline before that.

1c. Create the Application Identity

You will need to create an Application in Azure Active Directory and grant it some basic permissions. You will also need to setup an authentication mechanism, like a secret. You will use this Application later when you configure the pipeline.  Here are the steps:

  • Navigate to the Azure Portal at https://portal.azure.com
  • Find the Azure Active Directory service in the list of Azure services
  • Select the option for App Registration on the list on the left
  • Click the link to New Registration to create a new one
App registrations
  • Enter an app name, select “this organizational directory only” and click on the Register button
  • On the resulting screen, select the link to Add a certificate or secret
  • Select the “Client secrets” tab and click on the option for New client secret
  • Enter a description, select an expiration period, and click the Add button
  • Copy the secret value (there’s a copy button next to it). We will need that secret value later.
  • Secret values can only be viewed immediately after creation. Save the secret before leaving the page.
Certificates and secrets
  • Click on the Overview link on the left to view the details about your app registration
  • Make sure to copy the application (client) ID. We will need that value later as well.
App overview

1d. Create the Resource Group

You will need to create an Azure Resource Group for all the resources we will use for MGDC, including the Storage Account and Synapse Workspace. Here are the steps.

Resource groups
  • Click on the Create link to create a new resource group
  • Select a name and a region
  • IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant
Create a resource group
  • Click on Review + Create, make sure you have everything correctly entered and click Create

1e. Create the Storage Account

You will need to create an Azure Storage Account to store the data coming from SharePoint. This should be an Azure Data Lake Gen2 storage account. You should also authorize the Application you created to write to this storage account. Here are the steps.

  • Navigate to the Azure Portal at https://portal.azure.com
  • Find the Storage accounts service in the list of Azure services
Storage accounts
  • Click on the Create link to create a new storage account
  • Select a subscription, resource group (created in step 1d), account name, region, and type (standard is fine)
  • Make sure your new account name contains only lowercase letters and numbers.
  • IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant
Create a storage account
  • Click on the Advanced tab. Under Data Lake Storage Gen2 check the box to Enable hierarchical namespace.
  • Click on Review, make sure you have everything correctly entered and click Create
  • Wait until the deployment is completed and click on Go to resource
  • Click on the Access keys option on the left to see the keys to access the storage account.
  • Click on Show for one of the two keys and use the copy icon whenever you need the key.
Access keys

1f. Grant access to the Storage Account

You will need to grant the Application Id the required access to the Storage Account. Here are those steps

  • In the Storage account you just created, click the Access Control (IAM) option on the left.
  • Click on the link to Add on the horizontal bar
Access Control (IAM)
  • Click on the link to Add on the horizontal bar and click on the option to Add role assignment
  • In the Role tab, select the built-in Storage Blob Data Contributor role and click on the Next button
  • In the Members tab, select user, group or service principal and click on the Select members link
  • In the Select members window, click on the application id you created in item 1c and click the Select button
  • Then click on the Review + Assign button
Add role assignment
  • Review the role assignment and click on the Review + assign button.
  • You’ve now completed the role assignment

1g. Create a container and folder in your Storage Account

The last setup step is to create a container and folder for the data you will bring from MGDC. Follow these steps:

  • In the Storage account you just created, click the Containers option on the left.
  • You will see only the default $logs container in the list. Click on the Container link on the horizontal bar
Container with folder
  • Click on the newly created container and in that container, click on + Add Directory
  • With that, you have a location to later store your data with the path as container/folder
Folder
  • You’ve now completed all the preparation steps

2. Run a Pipeline

Next, you will configure a pipeline in either Azure Data Factory or Azure Synapse. We will use Synapse here. You will trigger this pipeline to pull SharePoint data from Microsoft 365 and drop it on the Azure Storage account. Here’s what you will need:

  • Create a new Azure Synapse workspace. This is the place where you create and run your pipelines.
  • Use the Copy Data tool in Azure Synapse. This tool will help you with the task.
  • Create a new source to get the SharePoint sites dataset from Microsoft 365.
  • Create a new destination with a storage folder in Azure Storage to receive the data
  • Deploy and trigger the pipeline
  • Have someone in the Approvers group authorize the request
  • Verify that the pipeline has finished running and that the data is available

Let’s look at each one of these.

2a. Create the Azure Synapse workspace

To get started, you need to create an Azure Synapse workspace, if you don’t already have one. Here are the steps:

  • Navigate to the Azure Portal at https://portal.azure.com
  • Find the Azure Synapse Analytics service in the list of Azure services.
  • Click on the Create link to create the new Azure Synapse workspace
  • Enter the subscription, resource group (created in step 1d), the new workspace name, region, storage account name (created in step 1e) and new file system name.
  • IMPORTANT: You must use a region that matches the region of your Microsoft 365 tenant
Create Synapse workspace
  • Click on the Security tab. Select the option to Use only AAD authentication. Click on the Review+create button.
  • Click Create. Wait until the deployment is completed and click on Go to resource

2b. Use the Copy Data tool in Azure Synapse

Our Azure Data Factory pipeline will use a data source (Microsoft 365) and a data sink (Azure Storage). Let’s start by configuring the data source in our Data Factory. Follow the steps.

  • Navigate to the Azure Portal at https://portal.azure.com
  • Find the Azure Synapse Analytics service in the list of Azure services.
  • Click on the name of your Azure Synapse workspace (created in item 2a)
  • Click on the Open link inside the big box for Synapse Studio
Azure Synapse
  • In the Synapse Studio, select the fourth icon on the left to go to the Integrate page.
  • Click on the bug + icon and select the option for the Copy Data tool to start.
Integrate
  • Keep options for the Built-in copy task and Run once now. Then click the Next button.
Properties
  • You will then have to define the source and destination

2c. Define the data source

The first step is to define your data source, which will be the Microsoft Graph Data Connect (MGDC source). Here are the steps you should take:

  • On the Source data store page, click on the New connection option
Copy data store
  • On the New connection page, enter “365” on the search box and select Microsoft 365 (Office 365)
New connection type
  • Click the Continue button to reach the page to define the details of the new connection.
  • Enter the Name and Description for the new connection
  • Also enter the Service principal ID and the Service principal key. These are the application id and the secret that we captured in step 1c.
New connection properties
  • Click on the Test connection option on the bottom right to make sure the credentials are working
  • Then click on the Create button to create the new connection and go back to the Source data store page
  • This time around, the connection will be filled in and the list of datasets will be available
  • Check the box next to SharePointSitesDataset_v0_Preview and click on the Next button
Source data store properties
  • In the Apply Filter page, keep the default scope.
  • Select SnapshotDate as a column filter and select a date.  Use the same date for Start and End Time
  • IMPORTANT: Valid dates go from 23 days ago to 2 days ago.
  • IMPORTANT: You cannot query dates before you enabled SharePoint dataset collection.
Apply filter
  • Click on the Next button to finish the source section and move to the destination section

2d. Define the data destination

Next, you need to point to the location where the data will go, which is an Azure Storage account. Here are the steps:

  • On the Source data store page, click on the New connection option
Destination data store
  • Select the option for Azure Data Lake Storage Gen 2
New connection type
  • Click the Continue button to reach the page to define the details of the new connection.
  • Enter the Name and Description for the new connection
  • Change the Authentication type to Service Principal, add the Storage account name from the drop-down list.
  • Enter the Service principal ID and the Service principal key. Again, these are the application id and the secret that we captured in step 1c.
New connection properties
  • Click on the Test connection option on the bottom right to make sure the credentials are working
  • Then click on the Create button to create the new connection and go back to the Destination data store page
  • This time around, the connection will be filled in and a few options will be available
  • Enter a Folder path. This is the container and folder you created in step 1f and you can browse to it.
Destination data store properties
  •  Click on Next to reach the Review and finish page of the Copy Data tool

2d. Deploy and trigger the pipeline

Now we will deploy the pipeline and run it. Follow the steps:

  • In the Review and finish page, click the Edit link on the top right to enter a name and description for your pipeline. Then click Save.
Enter task name
  • Click on the Next button to start the deployment.
Deployment complete
  • Once it’s all finished, click on the Monitor button to see how the pipeline is running
Pipeline runs
  • If you click on the Pipeline name, you will see the detail for each activity in the pipeline. In this case, you should see only one activity in the pipeline, which is the copy of the dataset.
Consent pending
  • You should see the status as ConsentPending, which means this request needs to be approved.

2e. Approve the Request

Before the data can be extracted, someone in the approvers group will need to approve your request.

  • Everyone in the group you defined in steps 1a and 1b will receive an e-mail requesting approval. The user triggering the request will be on the CC line.
Approval e-mail
  • Have someone in the Approvers group approve the request as described in the e-mail.
  • They must use the Privileged Access Management (PAM) page in the Microsoft 365 Admin Center.
Privileged access requests
  • Click on the latest to see the details for the Data Access Request, including who is requesting data, which dataset is being requested, which columns are included, where the data is going to be stored, etc.
Data access request
  • Click on the Approve button to allow the request to pull the data.
  • IMPORTANT: The user approving the request cannot be the same user issuing the request.
  • After approval, the pipeline should progress to the next stage
Extracting data
  • Wait until the status for the activity and pipeline reaches Succeeded. This could take a few minutes, depending on the number of sites in your environment.
Pipeline succeeded
  • Once the pipeline has finished running, the data will be in Azure Storage, in the container and folder that you have specified. It shows as one or more JSON files, plus a metadata folder with information about the request.
Folder with JSON files

3. Create a Power BI Dashboard

The last step is to use the data you just got to build a Power BI dashboard. Here’s what you will need:

  • Create a new Power BI file.
  • Query the data from the Azure Storage account.
  • Create your dashboard.

3a. Create a new Power BI file

Now that you have the data in Azure Storage, you can bring it into Power BI to build reports and dashboards. Here’s how to get started:

3b. Query the Data

Now you can bring the data into Power BI, directly from Azure.

  • In you new Power BI report, in the Home tab, click on the Get Data dropdown menu and click on More
Power BI get data
  • In the list of sources, select Azure, click on Azure Data Lake Storage Gen2 and click on Connect.
Get data - Azure Data Lake Storage Gen2
Enter URL
  • In the next screen you need to authenticate to the storage account.
  • Select the option to provide an account key, which was mentioned in step 1e.
  • Click Connect
Enter account key
  • In the following screen you will see the list of JSON files coming from the storage account
List of JSON files
  • Note that you get two JSON files, but keep in mind that one of them is just the metadata file
  • Click on the Transform Data button to load all the files into a Power Query
  • The Power Query Editor window will show, with the files listed.
  • First, change the query Name from Query1 to a more meaningful name
  • Next, scroll to the left until you find the Folder Path column
  • You should see one of the paths that includes a metadata folder. We want to filter that out.
Folder Path column
  • On the row with the Folder Path that includes the word metadata, right click that cell, select the Text Filters option and then the Does Not Contain option. That will get rid of that row only.
Text filters - does not contain
  • Now that you removed the row for the metadata, scroll all the way to the right to find the Content column
  • On the Content column, click on the icon with two down arrows called Combine Files (see arrow below)
Combine files
  • At this point Power BI does a whole lot to the data, including loading the JSON file, renaming the columns, and expanding the columns with structures (like Storage Metrics and Owner).
  • You can now just click on the Close and Apply button to close the Query Editor
Close and Apply

3c. Create the Dashboard

Now that the data is available in Power Bi, let’s create some dashboards.

  • After you close the Query Editor and go back to the main Power BI window, you will have all the Sites data available to you to create reports and dashboards. They will be under the Fields pane on the right.
Fields pane
Dashboard complete

4. Conclusion

You have triggered your first pipeline and populated a dashboard. Now there’s a lot more that you could do. Here are a few suggestions:

  1. Investigate the many datasets in MGDC, which you can easily use in your Synapse workspace
  2. Trigger your pipeline on a schedule, to always have fresh data in your storage account
  3. Extend your pipeline to do more, like join multiple data sources or take the data to a SQL database
  4. Publish your Power BI dashboard to share with other people in your tenant

You can read more about the Microsoft Graph Data Connect at https://aka.ms/mgdcdocs. There you will find many details, including a list of datasets available, complete with schema definitions and samples.

3 thoughts on “Gathering a detailed dataset on your SharePoint Sites

  1. Awesome Post Jose. This is really helpful information for organizations to unlock more insights from their data SharePoint and OneDrive datasets using modern big data technologies.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s