Thanks Vince. This will be the major topic of today’s meeting.

-mdg


On Oct 24, 2018, at 12:38 AM, Vincent Bareau <vbareau@EBSCO.COM> wrote:

I agree that it would be useful to create a formalized technical blueprint for the direction reporting support within Folio. However, it is my perspective that there has been an established direction for quite some time.  Expressing that in the form of a technical blueprint would contain the following:

The evolution of the current established technical direction is the following.
  • The Reporting SIG master spreadsheet - started in May 2017
    • this documents existing use cases from institutions involved with the SIG.
  • An analysis period followed to examine the cases found in the above spreadsheet with the goal of guiding the design of a solution within the constraints of Folio.
  • On Nov 27, 2017 Peter Murray presents to the Reporting SIG an overview of Data Warehouses vs Data Lakes.
  • In parallel, around this time, Qulto was tasked to evaluate the available landscape of open source Reporting/Analytics platforms and frameworks. A key consideration was if one was to be included into Folio, its licensing model would need to be compatible with Folio's Apache v2 license.
    • The conclusion was that there were not very many available OSS reporting frameworks and that they were not any serious contenders with a licensing model compatible with Folio's Apache 2. 
    • Furthermore, it was deemed that most of these tools were focused on the visualization aspects of reporting, predicated on direct access to a monolithic back-end SQL database. 
    • Finally, it was noted that these OSS solutions lagged severely behind the capabilities of commercial offerings, be they building block offerings such as AWS Redshift or turnkey solutions.
  • The chosen technical direction was presented at the Dec  04 2017 Reporting SIG meeting and discussed for several meetings afterwards. To summarizes:
    • informed by the uses cases in the Reporting SIG Master Spreadsheet.
    • some use cases represent import/export functionality which are best handled as part of the appropriate apps rather than as reports.
    • some use cases represent reports limited to a single domain and best handled as "in-app reports" within the individual apps - these are typically dashboards, statistical or otherwise tabular in nature.
    • the remaining use cases represent cross-domain reporting which requires data provided from disparate apps. These would be handled in an analytics system.
    • Some early notions of using a Data Lake and Edge APIs are introduced
  • A slightly refined version of the same presentation was presented in a Reporting breakout session at the Madrid Dev. Meeting Jan, 2018. 
    • This added the notion that Folio might provide a report designer (which is UX consideration and therefore can be dealt with later)
  • As a result of the Madrid discussions, a small team, led by EBSCO's Matt Reno, created a Proof of Concept to illustrate how a Data Lake could be used to generate a near real-time analytics cross-domain report (circulation). This was demoed to the Reporting SIG 
    • Okapi modifications to extract data; AWS Kinesis as the data stream; AWS S3 as the Data Lake; AWS Glue + AWS Athena as Data Abstraction (i.e. simple Data Warehouse); BIRT as the visualization tool
    • Missing from the PoC were Edge APIs to resolve transactional identifiers (i.e. UUIDs).
    • Lessons learned from the PoC were used to inform the design of AES and refinement of the Data Lake integration (message queue)
The following constraints guided the established technical direction.
  • Folio is based on micro-services and a monolithic database is not available
  • Folio operates on data which may not be stored in Folio but are externally linked instead (e.g. KB,or Student Information Systems)
  • Open Source analytics tools and Frameworks are not available to be incorporated into this project
  • Commercial analytics offerings are much more capable
  • Institutions may already subscribe to commercial analytics tools or platforms - different ones for different institutions
  • Institutions will want to integrate other (non-Folio) campus systems to their analytics platform
  • Building a homegrown Analytics platform from the ground up is very large undertaking. Case in point Alma has not developed its own but chose instead to license Oracle Analytics.
This all led to the logical conclusion that:
  • Folio should not attempt to develop its own homegrown analytics platform
  • In order to preserve freedom of choice and best-in-breed options for analytics, Folio should allow integration with existing (commercial) Analytics platforms. 
  • Folio should seek to support the needs of existing Analytics platforms
  • The choice of a Data Lake over a Data Warehouse is the more flexible choice as a Data Warehouse can always be implemented downstream of a Data Lake. A Data Lake is by definition comprehensive and the Data Warehouse selective. Although neither should be part of Folio per se.
  • Folio should provide a streaming mechanism to feed any Data Lake hosted externally (or other external systems).
  • Folio should provide a message queue which can be used to buffer events that are exported by a stream
  • Folio should provide Edge APIs to be used for identifier resolution (to supplement streaming mechanism). Identifier resolution also means that the current state (metadata) of the object to which the identifier belongs can be retrieved through the Edge APIs.
  • Edge APIs also offer an abstraction layer that can help mitigate internal Folio data schema changes from external systems (such as a reporting system).
  • Furthermore, Edge APIs could also be used to deliver any bulk data export. This is actually already being developed in the case of OAI-PMH.
Meanwhile,
  • In June 2018 the Reporting SIG generates a proposal to create a "Reference Implementation" for a Data Warehouse. This would be a Data Warehouse to be downstream of a Data Lake (as per the established technical direction). In other words aligned with any AES effort.
  • A PoC project kicks off in the Fall of 2018 and very soon morphs into a Library Data Plaform (LDP) which now intends to mostly bypass any data lake, focusing instead on rebuilding a copy of Folio storage as a monolithic database using bulk data exports of all Folio objects. 
  • The LDP project appears to be running towards the clear risk of creating a homegrown analytics platform as part of Folio. This is not to be taken lightly, nor should we underestimate the scope of such an effort.


_
V

From: tech-council@ole-lists.openlibraryfoundation.org <tech-council@ole-lists.openlibraryfoundation.org> on behalf of Peter Murray <peter@indexdata.com>
Sent: Monday, October 22, 2018 6:00:10 PM
To: tech-council@ole-lists.openlibraryfoundation.org
Subject: Re: Reporting
 
CAUTION: External E-mail

The requirement for a batch ETL took me somewhat by surprise -- beyond the initial load I thought everything could be found through processing a data lake of raw transaction data side-streamed from Okapi.  I now understand better the desire for processing on the as-is state of data in a module.  I'm still trying to understand Vince's concept of an "edge API" for entity resolution (UXPROD-355 through UXPROD-358) and how they relate to the overall goal.  Tod's description of the state of what libraries want matches my expectations from my days as a systems librarian.  I can also relate to David's desire to not have FOLIO impose tight specifications for a data warehouse as he has a desire to use existing data warehouse capabilities.  

Regarding Mike's point #1: "It is concerning that the data warehouse approach may be too closely tied to schemas inside FOLIO in that changes to FOLIO would force upkeep/maintenance of the ETL" -- there is a need for a versioning aspect here that may not be stated yet.  The data (whether side-streamed from Okapi or a batch dump from the storage module) needs to carry schema versioning information with it so the ETL process knows how to handle it.  I would offer that a data lake of module transaction data with a version label is _in scope_ for FOLIO.  Where I start to get uncomfortable is whether the "T" part of ETL is in scope for FOLIO or not.  I remember discussions with Harry about whether there exists some sort of meta-language for describing transformations into data warehouses, but finding nothing of the sort.

If such a thing doesn't exist, then we do need to formalize some sort of way for a module to describe the data it has as well as characterizing what is available as an AES side-stream from Okapi and what is available in bulk.


Peter
On Oct 22, 2018, 11:06 AM -0400, Mike Gorrell <mdg@indexdata.com>, wrote:
Others - please chime in with thoughts.

-mdg


On Oct 19, 2018, at 1:24 PM, Tod Olson <tod@uchicago.edu> wrote:

To lay out some of the context explicitly:

The Reporting SIG raised the need for implementing libraries to be write custom reports that gather data from across the disparate data domains or storage modules. This is an operational need for most if not all libraries to go live. These reporting needs range from communicating out to external entities like government and national organizations to local operational needs, including the need to do research on the state of one's data. Having the data split into different silos is a large obstacle, and having a solution that will work in a hosted environment is also large obstacle.

The Reporting SIG has presented these needs to the Product Council and asked for them to be met by what has been called Data Warehouse Reference Implementation. The PC seems to be in agreement with the Reporting SIG that these are critical needs to be met at the project level. Put another way, how many libraries will adopt a system that cannot demonstrate solid reporting? That is a valid project-level concern.

Some of the pushback within the TC is about what is and is not properly part of FOLIO, arguing that a data warehouse is outside of FOLIO proper and would be a considerable drain on resources. To me this seems like misalignment around scope between the TC on one hand and the Reporting SIG and PC on the other.

There is an additional question about the technology needed to support reporting. The Reporting SIG has tried to state the outcomes, but not proscribe technologies so as not to preclude some possible solutions. That said, many in the Reporting SIG are accustomed to direct SQL access to the data and have a bias in that direction for the usual reasons: comfortable with the capabilities, the there is already a sunk cost with training that could be leveraged. The TC (or some delegated party) should feel free to propose a non-RDBMS solution, but acknowledge the need to sell a different type of solution.

It is also the case that the Reporting SIG, in anticipation of a data warehouse reference, is currently dividing up the reports in the master spreadsheet and looking for people within the institutions (often the SIG member) who can take responsibility for writing those reports. These are all going into JIRA, assigned to SIG members. I don't know if that's useful context for the TC. It does speak to the Reporting SIG expecting to do a lot of the work themselves. And the intent is to be able to share reports, once written, among the libraries.

I rather suspect that there will need to be some direct dialogue between the Reporting SIG and the TC.

-Tod

On Oct 17, 2018, at 12:55 PM, Mike Gorrell <mdg@indexdata.com> wrote:

Here are my notes/thoughts after our meeting today. Please comment as necessary.

  1. The Tech Council (and FOLIO at large) doesn’t have a technical blueprint on meeting reporting needs for FOLIO.
  2. We have 2 potentially related efforts, A reference implementation of the Library Data Platform (UXPROD-1128  and what turned into the AES (UXPROD-330) - both of which earlier were thought to represent the ‘official’ approach to reporting. These two aren’t aligned.
  3. FOLIO needs an official approach to reporting, including a technical blueprint and a clear delineation between what the FOLIO platform provides and whatever an implementation/tenant needs to provide themselves.

Specific thoughts related to the LDP document and the proof of concept that is being built:
  1. It is concerning that the data warehouse approach may be too closely tied to schemas inside FOLIO in that changes to FOLIO would force upkeep/maintenance of the ETL
  2. The original thought of FOLIO providing a Data Lake (from which a tenant would create potentially its own data warehouse or other reporting capabilities) may alleviate some of the maintenance burden associated with changing FOLIO Modules/schemas.
  3. The proof of concept is scheduled to finish in a week or two. We agreed to provide the core team support needed to finish this.
  4. It is assumed that the LDP effort aligns with what the Reporting SIG desires. In as much as the LDP isn’t totally aligned with the TC, that means that the TC is not aligned with the Reporting SIG, which is a problem.

The Tech Council has architectural responsibility for FOLIO and its approach to reporting. As such, the following actions are recommended:
  1. Creation of a reporting architectural blueprint. This would include where data is streamed vs batch exported, etc, where data lakes may be involved and generally how tenants would report against their FOLIO content and activity
  2. We need to resolve open questions related to the LDP approach/prototype, starting with documented and prioritized concerns. These should be discussed with Nassib. It’s possible those discussions impact the Blueprint. 
  3. We need to align with the Reporting SIG. Once we have performed the actions above we should meet to present the blueprint and discuss their needs and concerns.

Looking forward to your comments. I’d like to have some email dialog (or would you prefer a Google Doc?) so that we have as many issues ironed out prior to next Wednesday’s meeting. If anyone wants to volunteer to draft any part of the reporting blueprint or documented concerns with LDP, please do volunteer.

Thanks!

-mdg


To unsubscribe from this list please go to http://archives.simplelists.com

To unsubscribe from this list please go to http://archives.simplelists.com

To unsubscribe from this list please go to http://archives.simplelists.com
To unsubscribe from this list please go to http://archives.simplelists.com
To unsubscribe from this list please go to http://www.simplelists.com/confirm.php?u=SeK0ArgpLZB2ijVHc1q8eivZ4CXy8J2w