EVENT DETAILS
Road Testing Community Data CO-OPs – Social Data in Action
22 July 2021
Speakers:
Prof Jane Farmer, Swinburne node, ADM+S (chair)
Prof Anthony McCosker, Swinburne node, ADM+S (chair)
Assoc Prof Amir Aryani, Head of Social Data Analytics, Social Innovation Research Institute, Swinburne
Prof Paul Henman, UQ node, ADM+S
Watch the recording
Duration: 1:01:07
TRANSCRIPT
Prof Jane Farmer:
Thanks everyone for being here. Obviously some more people will trickle in as we move along here. Welcome to our fifth and final webinar in the social data in action series. Hopefully you’ve been to all of the other ones and Amir, this is the culmination. So, we’re looking for something fabulous, no pressure. Yeah so, it would be great to see your faces if you feel brave enough at any point and definitely at the end of the Q and A session. So, let’s move on, Paul.
And so, I just want to acknowledge that we are hosting, or I am hosting this webinar from the lands of the Wurundjeri people of the Kulin Nation, and I also acknowledge the traditional custodians of the various lands on which you’re all working today, and the Aboriginal and Torres Strait Islander people participating in this webinar, and I pay my respects to elders, past, present and emerging, and celebrate the diversity of aboriginal peoples and their ongoing cultures and connections to the lands and waters.
Okay so, Amir’s going to speak for 25 to 30 minutes. We do encourage you to stack your questions into the chat as we go along, and obviously we’ll call for questions at the end which we’d love you to put into the chat and Anthony, if he ever makes his way through Zoom and into this webinar. I will be managing the Q and A at the end and we will be recording this session. So, if you have any challenges or problems with that, please get in touch with Paul via swinburne.edu.edu.
I might start to introduce Amir. So, Amir Aryani is the head of the social data analytics lab in the social innovation research institute at Swinburne, and the lab applies contemporary and emerging cooperative data analytics techniques to provide insight into health and social problems. Amir is a computer scientist by background and he has worked with illustrious international institutions including the British Library Orchid, The Institution for Social Sciences in Germany, and on projects funded by ARC NHMRC and the national institutes of health. I just wanted to know before Amir get started, that Amir is a great example I think, of how community and social data projects and innovation benefit from a mixed team with a mix of different inputs on the team. So, if anyone was present at Sarah Williams talk recently in this webinar series, she noted how project teams should have data scientists, social science specialists, a lived experience, and a lot of our projects have community organisations involved in them. So, in this way high quality social data projects become a space where the vital facets of knowledge are melded for innovation and insight and. Today Amir is going to focus on our innovative work with community data co-ops and collaboratives. I’m now officially handing over to you Amir.
Assoc Prof Amir Aryani:
Thanks, Jane. Thanks for the very nice introduction. Okay, let me test the technology. Can everyone see my screen? That’s good, thank you. Thanks a lot. Okay, sure, thanks. So, following what Jane said, I’m planning to actually, in next 30 minutes, give you a bit of background about Swinburne’s venture into the data cooperative projects, but also I will tell you kind of an overview of the toolbox of data science tools that we have available at Swinburne. And also this is a partnership with four other universities, so a lot of these tools and platforms that we’re going to talk about, they are available some of them in your institute directly, or indirectly through your collaboration and kind of collaborative links.
The concept of data co-ops has been something – they’ve been around for a while – but the cost of data cooperatives basically, in the context of this presentation we are talking about the concept of data core more abstract, as an element of bringing different data collaborative and data cooperation’s from different groups together. And from that point of view it’s not just about data cooperatives as what we know as a form of organisation. It’s more about the concept of actually enabling collaboration between groups, teams and communities. Now, to start, let’s talk about the couple of background things of how we got here. As Jane mentioned my background is computer science. I have worked a lot in a different source work with the scientists from the high geophysics to chemistry, to biology, to social science, and now more than ever we have access to the advanced data analytics capabilities in different fields. Social science and humanities are the ones that are kind of in a very interesting position when we look at the commercial sector. There is a lot of data capabilities there. When we look at the research sector in these domains, a lot to be desired. And I’ll get to this in a moment but in the commercial space data analytics and AI now is well in the past. It used to be a game changer. Now it’s an essential component. Like in 2018 I used to have this code about in five to ten years AI would be integrated part of a lot of different systems. Given the pandemic a lot of those things have been substantially accelerated and all of those things are already in place. The biggest changes have happened is the unstructured data. Previously working with that was a big challenge, and the information was inside or no – they are all interconnected. If you scan a coffee cup image on your phone the AI system underneath it knows where you are standing in a coffee shop, in a shopping centre. It knows you have previously purchased coffee from that coffee shop, and all of these elements come together to tell okay, I know exactly, this is their Starbucks coffee cup. So, it provides a very high accuracy given other connected information. So, the two transition of unstructured data to the structured data, and disconnected data to knowledge graph, is now embedded into a lot of commercial platforms.
You’re using the other concept that modern work attraction is a concept of augmented intelligence. That is bringing AI to the point of actually being an active assistant in a lot of day-to-day activities. The best kind of example of how this operates is, I don’t know, AI drives your car. I would tell you that you are going too fast, you need to turn right, you actually – have you paid attention to road work ahead? It knows about the different climates and if the road is raining, and basically it provides substantial assistance to the operation of the car. Same thing can happen in organisations. Already happening, a lot of commercial sector in supply chains, in transport, in a lot of fields that enables effective decision-making. Now these are all in a big corporate space but also a lot of activities happening in the social and urban organisations across the globe. We have initiatives like Nesta in Europe that works a lot with the platforms like collective intelligence and open data platforms. There’s a lot of effort around the data collaboratives in United States. We have the urban institutes, the data coin, and also in New Zealand we have the centre for social data. On us in Australia, this social data analytics lab or soda lab, we start doing work with the not-for-profit sector in creating similar capabilities. The main driver was lifting up the data literacy and data capabilities in the sector, and that also signified that well – that is apart from the data skills. There’s also lots of infrastructure components and the governance elements are missing, and that has been the motivation behind the whole data co-op platform.
Now, what is data co-op platform? It’s basically the methodology based on the idea that to create effective data projects that make a change, we need data, yes of course. But that’s not the only thing that we need. We need people, we need domain experts, data scientists, we need researchers, we need people who can actually transform data to actionable insights, and they cannot do it by themselves. They actually need analytics capability. So, that’s where usually university come to play and sometimes commercial providers. But the centre point of a data co-op are actually people who make a difference. To make an impact using all the insights that they drive from the data sets and during this course of projects that we have done in the last couple of years, we kind of built a model that we said like, if you need to do a trusted data partnership project that leads potentially to a data collaboration you need infrastructure that supports this. And that infrastructure talks about things like data storage, and data access talks about artificial intelligence. It knows capabilities of how to dispose the sensitive data after finishing the project. We have the infrastructure, but we assimilate a governance model that is where we’re dealing with the ethics problems. How do we actually manage the data life cycle, answering to the questions like the risk management fire safes model. How do we actually make our data findable, accessible, interoperable, reusable data set, or failed data? And these are the operation pillars if you like it, from the concept of trusted data partnership. When you have them in place, then you can actually focus on the creating the data collaborative projects and in that layer, in the data collaboration with a number of different initiatives and concepts. The name – few of them here, like the knowledge transformation collective intelligence, this is the space that you can actually look at the data communities, and also data cooperatives also goes into that layer.
Now I thought this is actually going to be useful to look at, different perspectives of communities regarding the concern on data sharing. In 2019 we ran a workshop in Canberra with a number of different government departments, not for profit sectors, researchers – the main question on the table and the roundtable discussion was about the responsible data sharing. And we had in the room, researchers who needed access to the data, but we also had a lot of data custodians, data providers. One of the things came out of that conversation – as we recorded, transcribe and analyse the text later on, and we also in kind of – it was a two days workshops, and the second day we went back to the result of analytics. There is a lot of discussion on the data access and this is not complaining about we want data we don’t actually have, we can’t get access to the data – it’s kind of other way around. A lot of data providers, they wanted to get their data reusable/usable to drive value from it and they wanted to provide data to the research sector and the commercial sector. The problem is there’s always this kind of shroud of doubt about how do we actually make data reusable in ethical ways, and how we can actually drive value from data without compromising the privacy and security. So, that has been one of topics of conversation. There was a lot of interest of making data as a kind of first-class citizen of the research and science. There are roadblocks to make that happen. And then when you kind of go past the ethical conversation around this, there’s a lot of discussion around the governance and data linkage, and data value by both sides of the part of this conversation. Now, if you think about the government data at the moment, you have the five safe model in place. This is the model.
This is a recommendation by data commissioners in Australia which basically is any data project that wants to access. The government data should answers risk assessment questions on five different pillars or components. The first group is that is it a safe project, so you have to justify that the project you are doing is a safe project. The second thing is that already the safe group of people who are running this project. Then there’s a safe settings that comes back to concern. Some data sets cannot be stored in servers overseas. Then there are data security, like this is kind of like, if everything goes to fail, what do I need in the data to protect itself? Do I need the identification encryptions and so forth. And then the output. Who’s going to manage the output of this? Now, this is a very good model to think about. What can happen in a lot of our research projects in the university sector, this is not currently applied outside the government, but there’s a lot of appetite for kind of expanding this model to the not-for-profit, to the commercial sector, and for the university to the education sector. This is also a very kind of good segue to look at the risk assessment model. The risk assessment, the classical way of managing a risk of the data project usually comes from the likelihood of something bad happening, to the severity of the impact of that problem. And this is kind of a classic model. When we look at all data projects, we look at the data and output. Both of them are the content that we need to manage the risk for that. Then there’s the use cases that are settings of their project and the operation of this, and also the content of the project operation, and also the people who are involved. So, we can look at the five safe concept in this way, and a lot of research projects enable that conversation about, is it a safe project to go ahead? Now, with all of that background that is where we got to the idea of the data co-op platform. So, we knew that a lot of our operation at Swinburne initially was based on the co-design model and that was kind of a recipe to success for a lot of our projects. That was also the beginning of the concept of the data co-op. So, we said we want to put a trusted data partnership and we want to create value out of it. We established an iterative model that in that iteration we were actually reading information from different sources, from the government data to the not-for-profit sector community data sets, and the social media. We had a number of – we had a model for running a number of different co-design workshops, when the data is actually answering specific questions. We are workshopping these with the community and project partners. We were getting the feedback, we send the feedback to the data engineers, and we basically from that conversation, we produce data products that later on lead to insights. Now this process will be repeating these workshops again and again in order to basically provide the richer insights and more impactful actionable insights.
Now, the idea of the infrastructure was to create a platform that enabled this, given all of the things that I mentioned as a requirements of a trusted data partnership. And this transformed to a funded project by ARC Grant sustenance leading the Grant. From Griffith, University of Melbourne, and Utas, are the partners. And we built a platform. I will take you through that. That basically provides a capability for enabling these data co-op projects. Now, in the next couple of minutes- and we mainly focus on the data infrastructure activities, because these are kind of new components. And a lot of previous webinars we’ve talked about that in and out of the data governance, and the challenges around collaboration, but the data infrastructure is kind of new from our kind of operation point of view. We almost finished developing a lot of components and now they are at the point of providing service to our projects. So, this part is the one that kind of if you like, all the shiny objects in our toolbox. We are going to talk about artificial intelligence data visualisation, and some of the other elements that enable these components that come together, such as data access and data linkage model. Now, we have the hybrid data co-op infrastructure today, based on the need and requirements for a system. I know this looks like quite a techy perspective of the whole different things goes together. This is actually one of our interesting – that’s not one of our internal documents – and there’s a tool we are using that transform 2d to 3d. So, we have a 2d version of this that we use for – it kind of status checking of our servers – and then this produces nice 3d visualisation. But it also has some important essence of how the operation divided to different layers. So, we have the social media layer of data that we are managing and running, and collects continuous information from the media and social media. We have the public data layers, gateways to main government and education sector data sets, oring platforms, the ABS statuses, the data.gov, and then we have the secure data layers that predominantly running on azure cloud, and that’s been quite a good instrument to basically enable a lot of data co-op projects we are running, which I will mention. This is quite essential to actually curate and hold data in a secure environment that can be used for producing insights, without going to the detail of this. The main function of a lot of these boxes is that make the unstructured data to the structure data and connect that information together, and provide a tool for us to create data insights out of the mishmash of ideas and a lot of different disconnected information. For example, one of the things that we have in our secure data space is that we are plugged into the cognitive search from Azure. So, when they get the audio files or we get a PDF file, we basically transform that information to text and then provide a knowledge graph from the content of the text. So, that provides analysable material out of the unstructured information. And we also have a secure space when we actually get information from our clients or partners. We can actually store them in information that is disposable at the end of the project, or securely archivable, depends on the process that we’ll confirm in ethics.
Now a lot of these infrastructures at the end, produce two main front-end, either we produce analytics dashboards – I will show you one of those – or we produce Jupiter notebooks that basically show exactly what information has drive, to what kind of insights, and how we actually can walk back into those processes. Really, in the interest of reproducibility of the science, but also for any fact checking if you need to know exactly how they got to a given number. Now, a data insight that we kind of get out of this system usually connects to multiple different elements. So, it is linked to original data source, it links to a software that actually drives the data inside, is exactly – tells us how we got from the data source to a data set it produced. A transformed data set often is a result of the work and I’ll show you some of these kind of transformations in a minute. It connects to the organisations that are linked to that data inside. It tells you what other publications are linked to this, and who are the researchers. And in our ecosystem, everything is linked to Orcid and DOI, so this is quite a good transparency of this connected graph. Now, these are some of the examples of public insights. One of the things I mentioned, if you remember, everything comes out of Azure. It’s kind of like a private data insight. It comes from private data sources, so I’m not going to present those. I’m just going to go for the public insights that from the course of our data projects, has been useful to our partners.
So, these are some of the examples that you can see. There are actually open source code, you can go to the github and you can see how the code runs. For example, this is the inside. We drive from the AIHW public data set- relates to the mental health services. We know more than 38 percent of Australians in their survey actually contacted mental health services during 2018. Now, when you look at the insight like this, you know exactly where the data came from, so you know the source of data. You would see the transformation of this. You see the visualisation and basically is not just attacks, it is all the steps that take you to that kind of fact, and that is quite important for a lot of projects at the time. We often would wander around and well, how did we get to this statistic? And this process actually answers that question.
Now, as exciting as Jupiter notebooks are for data scientists, it’s not always useful for everyone. So, to make it more useful we have the visualisations and data dashboards like this. It’s kind of one of our other projects with Bendigo, and some data sets just transform to a data visualisation. This is on top of the power BI, and the game connects to our kind of Azure infrastructure. That’s where we see the access of people to internet in Bendigo and basically this tells us the story that about 20 percent – 20 percent of people that are in the city of Bendigo. They have no access to the internet in 2016. That’s based on the survey done in that time. But you can also see this one in different suburbs and different areas, and Bendigo on the right. So, the graph that you see on the right, the red bar is the number of people who don’t have access to the internet compared to the kind of red and yellow together, the total population of that area. So, this is much more tangible for people in Bendigo, and maybe around workshops they actually work with this. So, kind of like there are two phases in this one, the Jupiter notebooks is a quick way of producing insights and this is the one that we take to a lot of our workshops.
Now, I initially promised to talk about AI, so just you know, this stuff getting more technical. We use ai to actually tell us what data sets are actually linked together. So, there’s so many ways to link information together. One of them is that using their place based concept. There are different social variables or incidents that happens in different areas that AI can actually find correlation. This is a lot of different social variables and when you see blue here, it tells us people or characteristics of two different variables live together. For example, the chance of having a high income and having three different cars. Yeah, maybe. How about home ownership and having two or three cars? What about your jobs and the chance of renting? And this information using the power of AI, can be calculated in so many different ways, in all different kind of spatial lenses, like from the suburbs to states, to country, to different regional areas. So, you can basically look for all the different permutations of potential connections of different communities together. And we have done this in a kind of bigger scale. I’ll show you the results of that, but potentially it provides us some insights, like when we’re doing your projects in the city of Glen Iris. At the time we find very interesting patterns between the different industries that people work in and different kind of dependencies that they had to different social services. And the number of cars that they owned, or ownership of their properties. And some of those may or may not produce valuable research outputs. But for the policy makers and actual, the not-for-profit organisations in that area, they’re quite insightful in providing the right services to the right people. Now, when we put this into action, this is kind of picture of Victoria and we’re using the same model. We kind of ask AI to tell us what are the communities living here? And I kind of gave us this colourful picture of different groups. We didn’t know in the beginning what they are and who they are. We give them some names like the green ones or we call them CBD, since most people living here are in the high-rise buildings and they’re renting. A lot of them are students but they’re not just in Melbourne CBD. I mean the boxing areas on the left side of the map, right side of the map, you can see there are other CBD type structures with those green boxes in that area.
I just want to take your attention to two of the big major communities. Like we have identified the orange ones at the sort of area that have high income families. Living there are mostly in a kind of age range of 40 to 54, and then they have teenagers between 5 to 19 – unlikely that you find people in this area that are 25 to 34. They move to other suburbs and they likely have moved recently, so they’re established. They have high education, they’re professionals, and they’re in managerial jobs. When you look at the other on the purple one, all population almost double to the orange area. These are a lot of areas full of high level of immigration people, came from overseas, and compared to the orange area they have a lower median income and also they’re less likely to be aged 55 to 69. And they’re more established, move their house, but they’re more likely to rent. And they are, interesting enough, less likely to engage in volunteer work. So, there are lots of different variables that you can look for people living in different areas. Now, this is just one slice and I think there will be many other ways that we can do this with different focuses and also this is just one data set. So, there are different layers of data that you can add, to basically achieve this kind of computation. Now, we transform similar information to 3d visualisation. This is one of the capabilities we have in the data co-op platforms. It gets the data and basically produce 3d maps. We have a pipeline for it. The data cleaning is still quite mechanical, and done manually this is always the most expensive part of a lot of our projects. But then the remaining of these things are automated.
Also it’s a good time to mention that this multiple data layers are very important, both in projects like the infrastructure layer for – like this is for example, screenshots from some of our vulnerability layers that we have for one of our projects. You consider different datasets and they can be enabled in different ways that provide information to the participants in the data co-op project. But also these layers in the combination of them can feed to the AI for the clusterings that I mentioned. So, this kind of like, providing both for the human users to understand what is happening and also for your AI to provide that kind of augmented intelligence.
Now, speaking about the social media that I mentioned earlier, that is another source of information that we have available for a lot of our projects. We have, I think at the moment, we have more than two billion Tweets in our data lake and this is just one of our projects that uses the bushfire data that has seven hundred thousand tweets we collected in a specific period of time. That’s all related to the bushfire happened in kind of east coast, and I think it started from Victoria. All of this information is analysable in so many different ways and are accessible to machine driven API, but also that you have dashboards like this that you can query the data and read it and pull information from the system. We also have access to the commercial twitter API which enables us actually – for a given project – run a specific query and basically get the archive of the entire twitter data in the last 10 years, and find out exactly for example, how many people tweeted about the government policies and on Covid 19 in different areas. And that information can be mapped effectively and provide that kind of specific lens for place-based research. Now, I mentioned briefly the secure virtual machines and secure data access, that’s one of the boxes in our kind of ecosystem. Well, one thing we have done is that the secure data is essential for working with the sensitive data. So, when we have sensitive information, one of the problems we historically had was people putting this in their notebook, and they walk around and if you lose a notebook you’re actually potentially losing the sensitive information on your hard drive. Well, now that we have this pipeline, we also created a virtual machine on the cloud that has all the data analytics tools readily available from the python to the graph database to machine learning tools it provides. One of the key functionality for the people or staff for using this, is that they basically have their work resume at any time that they want. It’s just sitting on a cloud. When they turn it on, it’s immediately available. When they kind of finish their work they can just turn off their notebook, but this doesn’t have to turn off the machine. But it also enables us to basically have that bubble of data and compute in one place, and when the project is over, we can either archive it or dispose it. So, it’s very much kind of provide the assurance that data is not going to leak or get lost, also provides another function. And that’s for the projects that have a longevity of that we have that rich reproducibility of the data science. We can always go back to the same environment. The code is there, so we can actually rerun it and we basically get the data in place. Now, there is a lot of sensitive data platforms in Victoria funded by Australian research data commons and other groups around the medical health. This is not up to data standard. It provides our needs, like it’s not really designed for working with a interconnectivity and interoperability with other systems like hospitals. This is just a space, a secure space, and very effective model for working with a sensitive data that lands into our ecosystem.
Now take a quick note about the data co-op projects that we are running and experience that we got from them. So, just one of the examples we recently had, work with basically three not-for-profit organisations as part of the funding by Lord Meredith Charitable Foundation. The data project that we have run here started from the concept of working with the local data to produce insights, and this was also closed ecosystem work with all the domain experts from those organisations and our data science team to look at. The public and private data says well okay, we have all of these data sets, what does it actually tell us? It was quite a good data exploration exercise that led to very actionable insights for the organisation participating in this exercise. It also provided a good way for us to understand what the requirements are for the small to medium enterprises to actually engage in data projects. So, previously we had a lot of experience with the local governments and bigger organisations like red cross. This was exercised at the more contained level and one of the things that we learned from this exercise was there is a lot of value in the public insights. So, a lot of information that we actually search and kind of curate for them, they are coming from the public sources. So, yes their private sources are very useful and we got a lot of insights from those. Unfortunately, I cannot share those insights in this presentation since they are private and coming from the secure data sets, but these are the examples from the private information that we kind of, during the lifetime of those projects we drive, we find out for example people who are earning between two thousand and three thousand, they’re basically driving around or commuting around twenty kilometres in Australia. Not interesting enough. If you make more than three thousand, then you actually drive or commute less based on the ABS data. We looked at information related to the mental health and anxiety. For example, we found that about 32 percent of females reported experiences of some kind of anxiety at some point of their life, is all data back to 2007. But given the nature of the inside that has been valuable for the partners in the project, the same thing about the disability and so forth, and we are basically during this lifetime of those workshops, we produce a number of these insights for the group.
This is one of the examples of the private results that is not very sensitive. So, that talks about the good cycle, and we map their information using the engines that we have, and we found that they are basically – in the way that they measure the travel distance of their staff – they have basically contributed more than four thousand dollars to the community based on this, saving transport time under staff. Basically, the way that good cycle works is that they send the services to different areas and in that, very much focusing and hiring younger people and basically make them kind of job ready, like for society. So, in that context one of the things they were looking at was all the travels of these people, and the awakened services that they provide.
Now, lessons learned from this particular type of project, and also the infrastructure that we are running. Well, the first thing was we found data acquisition and data cleaning is the most expensive component and that’s not a surprise for industry to a great degree, but I think for education sector that’s to some level, it was surprising. The other thing that we found is that – and we knew this from the beginning, so it’s going to be a firm or kind of initial understanding – that data collaboration is an iterative process. So, you don’t get the data, you analyse it and write a paper, or give it to the client and walk away. It needs to be done in an iterative process that you’re continuously working with them. And you refine the result, you get the insight, you go back to the data sets. You basically build a pipeline of data, human interaction, in a way that actually produces value from data. The other thing is that data visualisation is not the goal, but that is what makes a difference. To transform data to the actionable insights without visualisation you just have data that no one understands and no one uses. We also find that there is a great value in public data sets. I cannot emphasise that enough. There has been a lot of investment in Australia around the reusability of research data, and there’s a lot of efforts and activities by different groups to kind of tap into the existing data sets, rather than collecting the information again and again. And our whole journey is kind of highlighting the same thing, that looking at existing data sets, reusing them, can provide a lot of value for the research and for a not-for-profit sector. And finally, data language is what you need to do for a lot of projects. But a lot of times the only way to connect data sets together is based on the sense of place. So, if you aggregate information for a given area then understanding of correlation of different phenomena and social variables in a given space, that is the best way of looking at the connection between information. And that is what we use for a lot of our kind of collaborative exercises, to bring information from different partners, from different organisations together. I think on that note – I can finish this presentation and I can just say that this has been so far an amazing journey. There’s a lot of capabilities here at Swinburne and also across all of our partners. If you have projects that you think can benefit from some of these, I definitely want to hear from you and work with you. Jane has been quite effective and amazing going around, I found all of kind of different projects that we keep doing and keep us always busy. And we don’t complain. So, if you have something that you want to collaborate with us, yes, raise your hand. On, that note, I can just pass the microphone back to Jane and Paul.
Prof Jane Farmer:
Thank you, Anthony, I think is in charge of this next bit.
Prof Anthony McCosker:
Sure, I can help to coordinate questions. And please feel free to add a question, either in the chat or raise your hand through the participants tab. Do you want to exit the screen sharing Amir, just so that we can see each other, each of us, a little bit better? And yeah, very welcome, very much welcome anyone to jump in with a question. If you want to know a little bit more, or if you have thoughts of your own in working with data in this way, I have questions. I always have questions, so I’m just going to kick off because I get to decide who asks the first question.
So, iIm just jumping in. So, one of the things that we always find frustrating, Amir, and i’m wondering if you can just tell us a little bit more about your experience in this space, you’re talking about the difficulties in the sort of I guess, operational layer with partner organisations, coming together and working together in the process of sharing data, but also looking for what kind of insights might be shared insights. Not just insights that will help their organisation specifically, or their mission, etc. I have a bunch of questions around that, you know, particularly in terms of the difficulties in building trust and building data agreements, and also – but, the question that I’m kind of interested in at the moment is what you think about the increasing role of data stewards or data custodians in these organisations? People who you know, seem to take responsibility or are most interested in pushing forward with data projects, and yeah, what your experience has been around that?
Assoc Prof Amir Aryani:
And that was a very long set of questions. I remember from the back, going forward. So, in the stewardship positions and data custodians, I think one of the problems that we have with a lot of projects is when we start engagement with our organisation – doesn’t matter government departments or even a smaller – there’s always a shroud of mystery of what data do they have and who owns the data, and how they can actually access that information. That is one of those areas that when we start a project, often even at a sign on, signing the contract, we don’t know what it is. Now, you’re right, as we actually start tapping to those data sets, then it’s different interests or sometimes competing interests start to bubble, as people start to share given datasets. So, that in a way you would always have influencers in those workshops and those conversations that try to actually take the directions off, if you like. The whole workshop, take the direction of the conversation and there is this – this is not new to the concept of research, it’s actually new to, even in the commercial sector, that is given anyone who shared a resource would have kind of agenda attached to it. Now, in the context of universities partnering with industry, this almost gets a bit more complicated because we often operate as a kind of like, providing research services. And then that puts us in a very strange position, because in one way they expect us to kind of be fair and do the ethical research. At the other side there are components underneath moving around that make things difficult, because there are different rules and different expectations goes with it. Now, this is something that we deal with, and I’m sure you have experience in doing this a lot. We deal with this often during those workshops, so often during the workshop that is where the main work happens, that we basically try to showcase different features and take attention of people in different facts. But end of the day, a lot of people, management involved in actually coordinating those activities. Now, one step back before in this, is that we need to sign those data sharing agreements and lease documents and contracts, and that is the most complicated part of it. Because we often get into a lot of challenges and difficulties around access into a given data sets when it gets to the legal requirements. And those requirements often come with the expectations attached to it. And that usually is the most complicated process for projects like this. So, I don’t know how much I managed to go into the depth of the questions that you asked, if I forgot something let me know.
Prof Anthony McCosker:
No, absolutely. There’s a question from Paul. I’m not sure if you want to jump on, Paul, and ask a question but it’s a pretty quick question you had. Paul?
Prof Paul Henman:
I’m just happy to do it. I was very keen, since the Australian research data commons and the federal government’s initiatives around the humanities research infrastructure or digital managed research instruction infrastructure, what this has obviously been going for some while, have you had any connections with that and some of the initiatives that are starting to be put forward?
Assoc Prof Amir Aryani:
Yes, that’s right. Thanks, Paul. That’s actually a very good question. So, interesting enough actually, the first component of this project was funded by ARDC. So, we are closely working with them. Some of the components around the data governance of this project has been done in a direct consultation with ARDC. So, they’re quite involved in that data governance layer process that you’re establishing. And in the infrastructure layer, we are quite connected to the increased facilities. We are working with the ADA, Australian Data Archive, we are working with Oren in that space. We are getting information from Oren into our system and also, we are working with this kind of increased facilities. And on the concept of the common infrastructure, so in that way, we are quite aware of the sector and we’re working with the players in that domain. Also all the DOI that are minted out of our system, all coming from the aortic services.
Prof Anthony McCosker:
Okay, so a couple of questions. Lee, it’s a long question there. I think, did you want to jump on and ask that one?
Participant 1:
Yes, we have the frustration in a cross-border community Albury Wodonga, which really operates as one community, but it’s just so incredibly difficult to to get simple data that tells you basic things. Like I saw a presentation the other day on Victorian breast screen participation data for the catchments that we serve in Victoria, and there was an assumption made from that data that the rates in Wodonga and Indigo might be lower than the state averages in Victoria because people were going to New South Wales. But you shouldn’t really draw that conclusion unless you can test that assumption with the new South Wales equivalent data, to know where people were coming from. And that’s just a small example of a zillion things every day that are very frustrating in this community. So, my question is probably really about just the model that you’re using, that overlay, would that sort of be applicable in an environment like this or have you come across those sort of issues before?
Assoc Prof Amir Aryani:
So, you mentioned two different things. So, let me just rephrase this just to be sure that I understand your question correctly. So, the first thing is that you kind of had a problem of accessing the data in a complete form. It’s almost kind of like a small slice of data that doesn’t tell you the whole story. And the other thing is that looking at all the different datasets from other sources, that they kind of basically provide the big picture, is that what you are asking?
Participant 1:
Yes, that’s a very good summary, thank you.
Assoc Prof Amir Aryani:
Thanks. So, on the first one, this is one of the classic risks of data science. It’s almost like if it was a biology concept, thinking about it. We’re taking very small sample tests and then derive a conclusion. And we say, well this drug is very safe because we just had like five people testing that all of them had no problem, and then applied this to one billion people. And that’s a classic example in data science, you want a data set that is complete and is based on normal distribution – has been collected correctly, and basically provides the coverage around the majority of cohort or population that are subject to client study.
I remember, without mentioning the name, I was in one of the workshop presentations, one of the commercial providers, and the data said about Australia so I asked to Zoom into a given area. At that time we had a project at the city of Glen Iris and I found that well, the actual number of people in that area who have answered to that survey are only two, you don’t derive information or any conclusion about the whole community with just two people. So, this is a very huge risk around the social science because it’s not recognised as much as it’s been recognised in the biology and health and other sectors, that a lot of papers in social science get reviewed. And you look at the sample size, it’s been like 1,000 people that answer the survey, and then derive the conclusion. If you have a similar drug test example, is not going to get approved in any shape or form, or the paper is not going to get published. So, this is not a problem, and part of the reason is that data collection in social science is complicated. Now, the problem that you mentioned is slightly different. It sounds like you have a problem in the area to actually access the right data set. Now, the other thing is that this overlaying data sets from other sources is definitely the way to go. In a lot of ways we use a concept that’s called proxy data. I may not have access to the information about people commuting in a given area but I might be able to access the petrol purchases and kind of the energy consumption in that domain. So, that can be something I can proxy to find out the usage of cars. This is an example of information can be used in different ways to actually derive conclusions about things that we don’t have data about. It’s not – this by itself is a very risky activity because using the wrong proxy, you might derive a wrong conclusion. But it is a way to actually cover the gap.
Prof Jane Farmer:
Can I just say as well, that it’s lovely to see you again Lee, and that this whole kind of issue of like rural areas data is something we’re really interested in and we have kind of dabbled with this kind of data – bricolage kind of concept – which is like chucking in all the data sets that you can get to see if you can get some findings, to put it kind of crudely. Because of whatever said that there’s often small numbers across massive areas but I also get what you’re saying about borders, right. Because you’ve got different data sets and different ways of collecting the data sets and different accessibility of the data sets. So, I think that what Amir’s saying about looking at other data sets that we might use together is probably a way that we could go, or that you could go ahead.
Assoc Prof Amir Aryani:
And also one thing else to mention, it might be useful, sometimes the data from the other sources provide a very important complementary part of the picture. Like for example, of this in urban area is homelessness data. If you look at just one local government you might have a picture that relates to their services, but if you look at all the other neighbouring LGA’s then you actually see potentially different stories, given that the people who are dealing with that problem – moving from area to area. So, that’s sometimes in some of these data co-ops, is actually a necessity of getting the data from multiple different sources. Especially with the geographical area to get a better picture.
Prof Anthony McCosker:
Jane, did you want to add your additional question that you popped in the chat as well? I have more too.
Prof Jane Farmer:
Oh well, I just I’m just conscious that the talk maybe sounded like ‘oh my god, how would we even start doing this?’ But by the same token, I know that we have started at point zero or scratch with a number of organisations, so I wanted to make it seem not super scary. So, my question Amir, is what advice would you give to a small organisation that maybe didn’t have specialist workforce but was really interested in trying to look at what extra value they might get from the data that they collect?
Assoc Prof Amir Aryani:
So, the two different things that might help them, one is that all the infrastructure that I mentioned there are basically – we did the engineering work so it actually made it extremely simple for people to go to data core workshops, to make those data products and inside happening. So, I think all the things that I mentioned is kind of running behind me and if you are actually running a not-for-profit organisation, running tries to do a data project, not even data core projects or the data project, and this is, you won’t see all of these infrastructures in detail, you just see that, look at all of these insights and services are just working and that’s the intention of this.
The other thing is that, as we have done my example around the Urala and good cycles and other projects I mentioned, they started in that context of being small data co-ops with a small number of data sets, and it grew very gradually. Now, the recipe for this is that those workshops are a very good vessel in a way, to get to a bigger plan. So, you start with a small project, you basically go into a number of different iterations but confine in like three or five months, and then you basically – from those – you would have a much better understanding about what is possible. And I think that is a very good opening for any data project in that space. Just start in a small pilot space when it’s manageable and produce some useful but limited number of insights, and that is kind of to give a taste about what can happen, but also it provides insight about what is possible. And that’s where you can plan and go ahead.
Prof Anthony McCosker:
There’s a question here Amir, from Erin, and it’s a really good question because we’ve just started a project where it’s about those connections between data sets within government and access, you know, across government departments where there is essentially a good will to data sharing, but still a lot of concern. And a lot of issues around trust in that sense. Erin, did you want to ask your actual question?
Participant 2:
Sure, Anthony. That’s a great intro. It’s really about whether or not we can piggyback off these previous successes in passing the five safe type risk assessment check, when the next data opportunity comes along, to access state or commonwealth government data sets. So, you know, does it stand us in good stead? Is there a way we can leave those past successes or do we just have to do the rounds and complete those processes every time?
Assoc Prof Amir Aryan:
So, Erin, I’m not actually aware of a formal process for this government right now. The federal government look at the process of kind of crediting different organisations to access the sensitive data from government, but that’s not based on five safe models that’s much more kind of detailed verification of organisations capabilities. Now, when it gets to the five safe, at the moment it’s still sitting at a level of recommendation, it’s in a framework but it’s not a detailed framework enough, that you said I’m going to pass these things by basically going through these steps one at a time. And as a result of this, as you said, you’re passing this again and again for every data project, for every single data sets. There’s no record of it. Somebody just say look, well, I’ve done a five-star project for this project, it’s safe so I can have a go. It’s not like ethics that you’ve done ethics and then you go and access many different data sets. It’s a conversation to have for every data set at every government department and I know this is frustrating and expensive to do but at least provide a framework, because previously we didn’t have that. Like we were talking to different data custodians and they were not even – they didn’t know what questions you meant to ask. So, at least there are now a way to communicate and a way to actually ask the right questions, but unfortunately there’s no way to record or reuse the answer to those questions.
Prof Anthony McCosker:
Just building on that, Amir. There’s a lot of uncertainty I guess, but also interest, in how we move from principles to clear processes around ethical questions. And I guess some ethical issues in dealing with data at the, say the community sector level, health sector level, et cetera, outside of government. For example, and I’m just kind of wondering about your thoughts on whether, or how you see building those ethical questions into the design around the data engineering side of things? The kind of work that you would want to do in the background in order to set up those processes smoothly, but how do we build ethical practice in at that level?
Assoc Prof Amir Aryani:
So, there is an established practice of the research data management that taps into the very superficial level of this conversation. So, what you do with the data in a sense that follow those rules but doesn’t actually answer the question, that what rules are applied. So, those would be dynamic from project to project but that was a whole idea of a trusted data partnership model, this is the idea of that model that we follow. We try to actually kind of build, well, so far we have kind of built it into our infrastructure, but we’re also trying to produce that governance model into this sort of practice question. So, following what just Erin asked, we don’t want to actually have to get the same problem internally in our own ecosystem. So, if we know what questions to ask and if we know what rules to follow and we know what procedures need to be in place to cater for different type of projects, the preferred model would be at least probably automating the wrong term. But put them in a rail in a way that you know exactly what needs to go where, for a different type of question. It’s kind of a decision tree in that way. So, that is the intention with what we are dealing with. Still, as probably mentioned in the part of the presentation, you’re building the blocks. So, what you’ve seen today are the kind of building blocks of the much bigger plan and we are almost like people sitting in a house, but at the same time you’re building the house. So, it’s kind of, you don’t have a luxury of just going out in a tent and just building a house and then move in. It’s just sitting here. And Jane comes a different question, Anthony comes with questions, and we have questions for the projects and we have projects from all different parts of our Swinburne, our partnership networks. And as we are actually going through these projects, we are putting these things together. So, the house is getting built together at the lifetime, and there is always the drawbacks. And otherwise, a lot of read-on work needs to happen and also the advantage is that everything you’re building is 100 percent applied because you derive by the usage of those, you’re not building something and then cut three one, and then start using that to figure out, oh that was a mistake. We find mistakes much earlier. That in some way saves money.
Prof Anthony McCosker:
I’m just conscious of time we do have one very big question right at the end there from Fiona, but I think we don’t have quite enough time to answer that question, which is really about where this can lead us with really big issues around data leaks and data security. And I think that’s partly addressed by your approach to private and secure platforms, as well as open and public platforms for data sets, as well. But I just want to thank you Amir, for your time today, and thanks everyone for coming, and for the insightful questions. I hope that this has been a fruitful series for everyone we’ve had some really great seminars I think, in the social data in action seminar series, and this was a really great way to end it, I think, because it was very practical. And these are projects that Amir and the team, we are all implementing, working with non-profits and health sector and government, public sector. So, thanks everyone for your involvement. Have a look for the videos via the social innovation research institute website, and or Swinburne commons, as well as the centre of excellence for automotive decision-making and society’s Youtube channel. Please like and subscribe, and we hope to see you in further webinars. Thanks everyone. Thank you very much.
Assoc Prof Amir Aryani:
Thanks, Anthony. Thanks, Jane, and thank you everyone for joining the presentation.