Knowing What’s in a Site Document Library is Difficult

Often, I learn when the need arises to do something. For instance, SharePoint Online sites are fine repositories for documents and other information, but they can be hard to control in terms of what is in document libraries or know about everything stored in SharePoint site files. Few of us have the patience (or time) to review the contents of every folder in a document library to decide what should be retained and what is past its best by date. Tools like retention policies and labels help by removing items automatically after their retention period lapses, but these are broad-brush blunt mechanisms that deal with every item in the same way rather than on an item’s merits.

Which brings me to my opportunity for learning. The Office 365 for IT Pros team has been working on our eBook since 2014. Over nine editions and countless revisions, we have accumulated a large amount of content in the SharePoint Online site we use. It’s easy to find out how much storage a site uses through the SharePoint admin center or with PowerShell, but that doesn’t tell us what’s in a site. I wanted to know what documents were in what folders, the size of the documents, and their authors.

Searches for a feature to do the job found nothing. Microsoft 365 includes SharePoint activity reports and file and folder sharing reports and usage reports. Site contents will tell me the total number of files in the document library (and even more in the preservation hold library), and the storage metrics for the site informs me about storage consumption (Figure 1). But nowhere do I find a simple listing of documents.

Storage metrics for a SharePoint Online site
Figure 1: Storage metrics for a SharePoint Online site

Processing SharePoint Site Files with PowerShell

Traditionally, PowerShell modules have handled the automation of administrative activities, like generating a list of SharePoint Online sites or Microsoft 365 Groups. The SharePoint Online PowerShell module is firmly in this tradition and doesn’t offer any ways to navigate within a site and report what’s found there. The PnP PowerShell module includes cmdlets to open folders and list files found there, and I’m sure that you could create a document report with its cmdlets.

I chose to use Microsoft Graph API requests instead on the basis that I knew the Graph better than PnP. More importantly, I had some code that I could reuse (rule number 1 of any PowerShell project is to find some code to start off). In this case, I had a script to decrypt protected SharePoint files by removing sensitivity labels from the files.

Steps to Create a SharePoint Site Files Report

Conceptually, the steps to create a report listing the files found in a SharePoint Online site are straightforward:

  • Connect to the Microsoft Graph with the correct permissions. It’s always a good idea to run the Disconnect-MgGraph cmdlet beforehand to remove any previous session.
  • Identify the target site.
  • Find the document library (for the purpose of this article, we assume the target site has a single document library; this is usually the case for most sites today, especially those connected to Teams).
  • Find the folders in the document library.
  • Report the documents stored in the folders and any sub-folders in those folders.

Although I use Graph API requests in the script to retrieve information about folders and files, I decided to use the Invoke-MgGraphRequest cmdlet from the Microsoft Graph PowerShell SDK to run the requests instead of a cmdlet like Invoke-RESTRequest. Using the SDK means that I didn’t have to create a registered app in Azure AD to hold the necessary Sites.Read.All permission needed to read site data. This is an acceptable method to use on a one-off basis. In production, it would be better to use a registered app and a certificate for authentication. The Microsoft Graph PowerShell SDK fully supports this approach, which has these advantages:

With those thoughts in mind, let’s consider some points of interest in the implementation.

Drives, Items, and Folders

The Microsoft Graph deals with SharePoint (and OneDrive) document libraries like computer drives (here’s an article about reporting OneDrive for Business shared files). After identifying the target site, the next step is to fetch whatever drives exist within the site. The third line looks for a drive named Documents. This is the English language version of the default document library. You’ll probably have to change this to make the script work against sites configured in other languages.

$Uri = "https://graph.microsoft.com/v1.0/sites/$($Site.Id)/drives"
[array]$Drives = Invoke-MgGraphRequest -Uri $Uri -Method Get
$DocumentLibrary = $Drives.Value | ? {$_.name -eq "Documents"}

Next, we get the items in the document library. These can be individual files and they can also be folders. Here, I fetch the contents and then filter them into folders and files.

Next, we get the items in the document library. These can be individual files and they can also be folders. Here, I fetch the contents and then filter them into folders and files.
$Uri = "https://graph.microsoft.com/v1.0/sites/$($Site.Id)/drives/$($DocumentLibrary.Id)/root/children"
[array]$Items = Invoke-MgGraphRequest -Uri $Uri -Method Get
# Find sub-folders that we need to check for files
$Folders = $Items.Value | ? {$_.Folder.ChildCount -gt 0 }
# And any files in the folder
$Files = $Items.Value | ? {$_.Folder.ChildCount -eq $Null}

One important point is that the code doesn’t include anything to handle pagination. Graph API requests limit the number of items they return to avoid problems which might happen if they returned very large amounts of data. In this instance, the Invoke-MgGraphRequest cmdlet returns up to 200 items. This should be enough for the root of a document library, but it might not be. If this is the case, you’ll need to check if the information returned by the request includes a nextlink (a URI to the next page of available data). If a nextlink exists, the script needs to run another Graph request to follow the link to retrieve the waiting data. You can see how to do this in the GetFilesFromFolder function in the script.

After finding the files and folders, the script reports what it discovers (Figure 2) and creates a report. Processing is rapid because the script only reads information. Generating a report for a site holding 1,308 documents took about five seconds. Another site with 7,512 documents took 15.

Running the script to report files in a SharePoint Online site
Figure 2: Running the script to report files in a SharePoint Online site

Output

The output is in a PowerShell list, so it’s easy to generate whatever output you prefer. I often use the Out-GridView cmdlet to review the output of a script (Figure 3), but you could also output a CSV file, an Excel spreadsheet (using the excellent ImportExcel PowerShell module), or generate a HTML file. For the latter, you could try the PSWriteHTML module (see its documentation for examples).

Using Out-GridView to examine details of files found in a SharePoint Online site
Figure 3: Using Out-GridView to examine details of files found in a SharePoint Online site

Not a Complete Solution

It’s important to emphasize that the script (downloadable from GitHub) is not a complete solution. Instead, it’s a proof of concept to demonstrate how to interact with SharePoint document libraries using Graph API requests. Lots more could be done to improve error handling, handle pagination for the root folder, handle multiple layers of folders, deal with sites that have multiple document libraries, and so on. I’ve tested the code against multiple sites, and it appears to work well. At least, it does in my tenant. I’m interested in what others think and the improvements you make. The script is, after all, just PowerShell code, so anyone can change (improve) it.

About the Author

Tony Redmond

Tony Redmond has written thousands of articles about Microsoft technology since 1996. He is the lead author for the Office 365 for IT Pros eBook, the only book covering Office 365 that is updated monthly to keep pace with change in the cloud. Apart from contributing to Practical365.com, Tony also writes at Office365itpros.com to support the development of the eBook. He has been a Microsoft MVP since 2004.

Comments

  1. Paul Conaty

    Could this be used to determine files with and without a Retention label applied? Would be useful for tracking rollout of labels.

  2. Laura

    I’m trying to call the creation date of the sharepoint sites in the api graph report, but I can’t get it to bring me data, I don’t know if I’m correctly calling the field $Site.” createdDateTime”

  3. Jim Chisholm

    I use Excel and Power Query to extract file metadata from an entire site.
    It’s then straightforward to present the data in Sliceable report tables for the entire team to use in determining file disposition.
    Thanks for posting your scripts – they’re good learning tools.

    1. Avatar photo
      Tony Redmond

      Sounds like a good idea. Maybe you’d write up the approach in an article for us?

Leave a Reply