[0:00:00] ANNOUNCER: Today, it's estimated there are over one billion websites on the internet. Much of this content is optimized to be viewed by human eyes, not consumed by machines. However, creating systems to automatically parse and structure the web greatly extends its utility, and paves the way for innovative solutions and applications.

The industry of web scraping has emerged to do just that. However, many websites are wrecked obstacles to hinder web scraping. This has created a new kind of arms race between developers and anti-scraping software. Bright Data has developed some of the most sophisticated consumer tools available to scrape public web data. Erez Naveh is an entrepreneur and former engineer at Meta. He is currently the VP of product at Bright Data. Erez joins us in this episode to talk about Bright Data's mission to structure the open web, and the toolkit they've developed to make this possible.

Pawel, is a tech lead and a software engineer with a background in launching products in startups and big companies. He's also the founder of flat.social. Check out the show notes to follow Pawel on Twitter or LinkedIn, or visit his personal website pawel.io

[EPISODE]

[0:01:20] PB: Hello, Erez. Welcome to Software Engineering Daily.

[0:01:22] EN: Hello, and thank you for inviting me.

[0:01:24] PB: It's great to have you here. So, let's start with the first question. What is Bright Data and what do you guys do?

[0:01:29] EN: Bright Data is on a mission to structure the open web. Basically, it's a 360 solution for public web data collection, and the public is really important aspect of it. Everything from proxies, books, management, web and logging tools, advanced scraping solutions, and eventually, collected datasets in a marketplace that sells our own pre-collected data and third-party data.

[0:01:58] PB: How did you get into building tech products? What were your first steps in the industry that led you to working at Bright Data?

[0:02:06] EN: So, I started building websites in the year 2000. Then, I have a studio that formed around building web solutions. There was no mobile web at the time. I did that for eight years, seven years, something like that. One of the clients was MSN, in Israel, and they asked me to build a gaming platform for them, and multiplayer, one on one games. So, I did it. Then, it became its own company, and I recruited – also got money from investors and built a company around that. I did that for another eight years, and build that company, and come to play, and we sold it to a PsyPlay, which is American games company, social games.

Then, I did a couple of things and I get a call from Meta, from Facebook at the time. I led the research platform team for five years. Basically, it's a team that is in charge of market intelligence for Meta, which is acting in the app world. So, all the competitors have any app on your device, on your Apple or Android device, and we grew it into three products. One is the market intelligence. The other one is survey platform. All the sales that are run by, I think, 900 researchers inside Meta, and spread on all the products. Then, I also bootstrap my own idea, my own product, which is now called Viewpoints by Meta. It's a crowdsourcing platform, and that allows Meta to run research, paid research So, they pay you to participate. Anyone can participate in research, complete long surveys and trained ML models on voice videos, text, images, and so on.

I did that for five years. So, I was looking for new opportunities within Meta or outside, and I came across this opportunity at Bright Data. And I found it very interesting because it's a hyper growth company. We have about 450 employees right now, and I really would like the challenge of building a [inaudible 0:04:31] culture. Like taking all I know from startups, all I know from a huge company like Facebook, and building a culture inside the company as a leader, that will allow it to grow into the size of Meta eventually. That's the challenge that I came here to do.

[0:04:52] PB: What are some of the most useful lessons that you have taken both from building the games and then later working at Meta that you’re currently applying at Bright Data?

[0:05:02] EN: I think that the similarities between the games industry and Bright Data are everything, and also in Meta, where I built apps at the end. I build consumer facing app, the Viewpoints app, and internal product. It's exactly the same here. We have an external product for developers, and for data people, and also for business people that only wants the data. They need to onboard, and they need to have no friction, and they need to – we need to keep them retained and happy, and not shared, and all of that. It's exactly like a game, in terms of making sure that you have the right mechanism in place and the right messaging, the right experience, and the right flow, to allow someone to go into that experience and have a good experience by onboarding and choosing the right solution, or having the right tutorial for the game, is exactly like having the right tutorial to how to use the port.

So, there are many similarities in that, but also, a lot of similarities internally in the company culture, but how do you collect the data, and the information you need to understand well, the opportunities to optimize and to grow? And where are the opportunities for big bets? And then how do you plan them? And how do you execute them with your R&D partners, and with the design, and with the marketing, and everyone? So, that's the playbook that keeps on coming with a little bit of flavor changes based on the size of the company, and the function that you have. Basically, the more you practice it, the better you become.

[0:06:47] PB: Definitely, because they would imagine that from gaming, technically, maybe they're different. But still the funnels of onboarding users, activating users, and the fundamentals, still remain the same. It doesn't matter which digital product you're building.

[0:07:01] EN: Yes. Actually, I will say that the games in general are the highest form of tech art that there is, and the most complex one, because from a product perspective, at least, not from a technical perspective, because you need not only to walk on, as I said, on the onboarding, and making the game fun and to work. You have a whole other department. You have animation, illustrations, sound design, and you have an economy. I had an economy of a one game with two million players a month, that I needed to give them visual coins, make sure that we have enough sinkholes, that the money will flow in and out of. So, you're also managing an economy of a small country. You have all the challenges all together at the same time.

[0:07:54] PB: Because I worked also as a software engineer in gaming and the technical problems and the optimization issues that I have run to were nothing similar to this, to the more kind of simple web projects that one would work on.

[0:08:08] EN: Yes. The optimizations are very, very different. Also, you have to support desktop, and mobile, and maybe even PlayStation or whatnot. So yes, the complexity is different. In Bright Data, it's very different because the complexity is scale. Now, we need to scale to support over 15,000 customers that are accessing the web through us, at enormous scale, and we need to be at the 99.9%, uptime threshold. So, that's a huge challenge.

[0:08:46] PB: Just comes from a different angle.

[0:08:48] EN: Exactly.

[0:08:49] PB: I would imagine, in gaming. So, let's speak maybe a little bit of web scraping and the kinds of basics and fundamentals of web scraping. What is web data? Who uses it and why?

[0:08:59] EN: So, when we talk about web data, we talk about public web data. And that's very important, because we want that everyone will have access to everything that is publicly available. I think the easiest to understand is that if you will open an incognito browser and go into any website, that's the public version of that website. Whatever they will feed you in that HTML that will land on, that's the public data that you can scrape. We don't allow, we don't want to support anything that is behind a login, behind regislation that is strictly for registered users. That's not public, accessible data for us. So, anything that you can access through incognito window without logging in. That's public web data.

[0:09:55] PB: How was this data scraped until now?

[0:09:59] EN: So basically, you can start off with the basic, most basic scraping, which is open your web browser access, punching the URL, and that's it. You got the HTML back.

[0:10:15] PB: As the organic scraping.

[0:10:15] EN: Yes. Everyone is doing it all the time. You basically access it, and now we are talking about automating it and scaling it. So, you can do it with your own browser, right? You have many macros, plugins that you can automate, going into websites and downloading the HTML, and saving it. That's all your – that's still a very, very common use case. There are many, many tools, really good tools that allows you to do it, on your own small-scale solution. But then, when you try to do it for larger use case, even if it's your own, or a commercial one, you start getting into limitations, because that website that you are trying to scrape will start blocking you. The easiest way will be just by identifying your IP address and saying, “Oh, I see that you got here too many times. I will start blocking you in a variety of ways until you're complete.” That's the most basic need for a proxy, because then you say, “Okay, so I can't use only my IP address. I need several IP addresses. Maybe I need them to be even faster. So, I'll go and use Bright Data for that. I need a million IP addresses today to go and scrape million pages at once, and I want that to happen in the next two minutes, and I get all the data at once.” Okay, so I can do that.

Then we can get into all the monitoring and limitations, making sure that you're not DDOS-ing the website, but we can touch on that later. But basically, that's the most basic use case for scraping at scale using many IP addresses. Then, there is the geolocation one, geolocation, which means that I want to scrape the data of that website from a specific country. Let's say, I want to scrape a website in Germany or in France, and I want to use local IP addresses from France, because I want to make sure that I'll get the version of someone for France and not form someone in Italy, because websites are now very personal and adaptive to your location, your browser. So, that's why we also have solutions for mobile IP addresses to make sure that you come across as a mobile user, that will give you, in many cases a very different version of that website, than if you would come from an IP address from a desktop.

[0:12:52] PB: I think that this will also apply to search results, because the search results will be highly different dependently on where the user is actually located. So, this is, I think, another great use case. 

[0:13:03] EN: Yes. We have this specific product for that. We'll just call the [inaudible 0:13:06] API. Search engine scraping, basically, and we have a playground for that anyone can access in a website. It's really cool that you go and say okay, “I want to see search results for pizza, from mobile, form Italy.” It can be also the geotargeting is not bound on only to a country. It can be also for ZIP code. It can be for – we're working on geotargeting that is based on latitude and longitude. So, you'll have IP address from that location, and you want to see the search result because you'll get very different search results for pizzeria in Milano and pizzeria, and if you're located in Rome. Because the ads will be different, ad targeting will be different, search results will be different. The language that you'll use will be different if you're looking at it as someone using English as the locale in the browser, or someone who's using German. So, it depends on so many parameters, and you can play with all of them in the playground and see the result change.

[0:14:16] PB: Would you say that the scale and the geolocation are the biggest challenges for people who are trying to do web scraping today. For the developers who are building web scrapers, would that be something else or something more?

[0:14:28] EN: It all depends on the target domain. So, there are target domain that all you need. The amount, the volume of IP addresses, form a specific geolocation. That's why you will be able to do it and access the public web data just by using data center IP addresses, which are identified as a data center. Those are the cheapest IP addresses that you can buy access to and that will be enough for many, many websites. But then you go into a website that are trying to be a wall garden, even on the public web data, and they try to have free market competition from happening. I'll give just one example, which is the permanent one, the permanent use case will be price comparison.

So, I have a website that is e-commerce website, and I'm selling the product, and I want to make sure that I'm staying competitive in the market, and I want to make sure that when I'm up against the big players, whether it's Amazon, Walmart, and whatever it is, in my local market, and many local markets will stay competitive. I want to make sure that I'm not pricing it too high or too low. And maybe I want to price it too high or too low, it depends on my – but compared to what, right? That's compared to the market. So, then I'll do some automation to scrape those websites on a daily basis, and maybe an hourly basis, depending how much this product price changes. I run into very hard blocking mechanisms from those big players that are using technology for the leading anti bot vendors out there. So, then I will run into blocking issues and fingerprinting issues, and IDs that will be blocked if they are used for data centers, and I need to use them from residential networks and so on.

The solution is all – it depends on the website that you're trying to scrape. That's why we have so many solutions, and it can be overwhelming for someone that is just starting. Then, we can go into the recommendations I have and how to approach it if you want this stage later on.

[0:16:48] PB: Because you're definitely – when I've been looking at Bright Data, I could definitely see this powerhouse of features that there is not only one product, because you will have proxies, or you have many different types of proxies. You have the different types of web scraping as well?

[0:17:01] EN: Yes. That's my challenge. How to explain it to someone who's just entering a product, how to choose the right one. It's really challenging, because some developers are well-versed in the world of web scraping, and some are not, and they might choose the wrong product and get disappointed.

[0:17:21] PB: So, let's maybe speak a little bit about the proxies. Why do we use those proxies, and when scraping data? And also, let's speak a little bit about the different types of proxies that you allow developers to use. Because you mentioned residential networks. You mentioned, I think, mobile networks as well. Could you tell me a little bit more about what types of proxies do you offer? How do they differ? And why are they useful, and the different ones are useful for web scraping?

[0:17:46] EN: Sure. So, everything can be useful web scraping. But let's dive into each one and see why and how. But also remember that web scrape webs, so web scraping is only one. And the main use case and price comparison is one and the main use case. But there are many, many – there's a long tail of use cases that we just mentioned one of them. For instance, I'm located in India, and I need to manage an account in Instagram, for my US customer. Instagram will block me with my India IP address, so I have to find a way to come out form an IP address that is local in the state, to manage my customer’s Instagram account, right? So, I'll buy an IP address that can also fix the IP address, that will not change, to make sure that it's like my home address or my computer address, and it's stable.

That's another use case. But let's say the main use case of price comparison and look at the different solutions. As I said, the cheapest one, and the easiest one to start with is just the data center. It's a data center IP address, and you can access it from our data centers around the world. Then, you have an ISP IP address, which means that internet provided us with a bunch of IP address, and we supply it to you. Then, there is the mobile IP address. So, we have mobile addresses, mobile IP addresses around the world. Then, there is the residential IP address, which is our network of real users that gave us access to the IP address so you can get that access through the device. That's the four types of IP addresses.

On top of that, we build smart solutions, and I'll explain why we did that. Because each one of the IP addresses will help you overcome blocking issues that the anti-bought provide those websites. I said, some don't have any. So, you can access through data center IP addresses, and then you try to do it with a – maybe you need to come up and scrape a mobile website, then you lose the mobile IP address. Then, you try to overcome the challenges by using some IP address that looks more human like because you're trying to see them as a normal user. You'll use the ISP address or the residential one. But that can be also not enough, because you will not come in with the right fingerprint. You don't look like a user that is using a normal browser, and maybe you need to have CAPTCHA solving, because they show you a CAPTCHA once to makes sure that you are human being and so on. And then you need either to build those solutions in house, or leverage our platform to do that.

So, we have smart solutions on top of our network, for instance, the web unlocker. So, the web a local will give you a very simple API. You call the URL, you call the API with the URL. You can also add parameters, for instance, the country that you're trying to come from, and that's it. You will get the HTML back. without dealing with any blocking issue, not the fingerprinting, not the CAPTCHA solving. Anything that's related to blocking will do it for you. We have an army of developers that are developing this solution using, of course, machine learning to make sure that we have proper pathways to unlock your website at scale, leveraging all IP addresses. So, you don't need to care about that. You don't need to care about managing the right IP address, the right unlocking solution. You just call API and get the HTML back, so you can parse it and use the data as you wish.

I also mentioned that we have the server API. Server API will do the same for search engines. So, it can be Bing, it can be Google, it can be DuckDuckGo, and so on, and you just call the API, and with the parameters of which search term you want, which looker, is it a mobile, whatever it is, and get JSON file structured. You don't even need to parse it, because it will give you a parsed result, you can, of course, get HTML, but it will also give you a JSON if you prefer, that is already passed and you can use immediately, again, at scale. We have customers that have 200,000 keywords that they need to scrape every day, to make sure that they are positioned right in the search engine, because that's the livelihood. Without that, they will lose the position, they will lose millions of dollars, they will lose the job, and it’s really important for them to stay on top of the search result.

[0:22:51] PB: So, would that be more – what you just mentioned, because I would imagine that especially the search APIs, they'll be incredibly useful both for the SEO professionals or in-house SEO departments? And possibly, yes, well, the actual tools that provide the tools – for the companies who provide the tools for SEOs.

Let's speak a little bit maybe for a moment about scale, because the amount of code that you set per day, would definitely raise a flag, I think on this site from Google, right? But you are able to circumvent that, and to provide this data.

[0:23:23] EN: And the reason that we can do it is because we have just the residential network is built out of millions of IP addresses that your real human are using those IP addresses. It's you and me that can install a software and get paid for it, and share our IP address, and access to the Internet through our computer or through other devices, and that's how we're able to overcome those limitations, because we have access to Google through those IP addresses. That's, of course, just one of the tools. We also need to have the right fingerprint, and the right – we need to look human when we come into those websites in order not to get blocked.

[0:24:07] PB: Because you mentioned CAPTCHA, you mentioned the browser fingerprinting. Obviously, the proxy would also help you to go around any kind of rate limiting as well, because it's obviously a different IP number. But the rate limiting, well, dependently on how is it implemented. So, this sounds like a constant race. Like a constant race of the R&D department that you have and the constant race between two sides. One doesn't want to give access to the free public data, and the other one tries to get it.

[0:24:36] EN: Yes. The cool thing about the web unlocking solution is that this race is done by us for many, many, many customers. While there are customers that are dealing with it alone, and it might work great for them if they're doing it for a specific vertical. Let's say that I'm in travel, and I have three destinations, three domains that I must create, and that's my livelihood. That's my business. Okay, so I need to be an expert in scraping those websites no matter what. I can't rely on a different company, another company, that will do it for me.

But if it's a website that is very common, and everybody's accessing it, and we have solved it for the 99% of the use cases, for those companies that are trying to scrape for e-commerce, for instance. Then, it will be much more cost effective to use our solution. Because you won't need the department of free research researchers and machine learning engineers and data scientists and engineers to develop the unlocking solution that will allow you to do it at scale. That's why the unlocking solution is such a magical product that allows you to call it any OM and get back resolved HTML and then pass it without dealing with anything.

[0:26:01] PB: I would imagine that even I sometimes probably had some issues with CAPTCHA. Have to look very close to the screen to try to figure out which one is the hydrant or which one is the monkey. 

[0:26:12] EN: But that if we got CAPTCHA, we did something wrong. We need to go and scrape a website at scale and make sure that we don't get CAPTCHA. CAPTCHA is the last resort. It means that we overused the IP address, and now we're starting to get blocked, in most cases. So, our goal is to not get CAPTCHA.

[0:26:33] PB: Yes, that makes sense. Because occasionally, even just through normal use, of let's say, Google search, I will still get CAPTCHAs. But that will be probably because I know maybe my pin number changed, or something else that flagged with the algorithm. Also, for the most popular sources. So, if you have, also, imagine a company that doesn't have a team of developers that want to do it, and they don't want to write any code at all, you also provide datasets, right? That are already ready for popular things.

[0:27:01] EN: So, we provide in the data line of product, we have three products. We have an IDE, web scraping IDE, which is for developers, which allows you basically to run on our cloud solution 100%, without having anything on your end, any infrastructure, nothing. You just write the code and just leveraging our functions to help scrape the web policy. We're introducing now AI solutions that will help you write the code even, and we have templates for you to start off by, okay, we already solve that website. So, you can have the code for that scraping and parsing and access it. But that's for developers, and that will give you unlimited amount of scale. So, you can scrape any website at any scale, without any infrastructure, get it as a dataset, send it, deliver it to any storage that you want, on a schedule, or with an API call, and you can also program your own parameters to pass to that API.

So, that's a really, really good solution for developers. For data people, or business people even. We have the ready-made managed solution, which is basically we said, okay, we see that the list of websites that many of our customers need, why would we ask everyone to scrape them all over again? And basically, it's not good for the website. It's not good for us, or the customer or for the world, or for energy, or whatever. It's not good for anything.

[0:28:35] PB: It’s not good for electricity.

[0:28:37] EN: Yes. Let's make sure that we are doing it wise, really, really good in high quality for everyone. Allow you to basically to access a dataset, filter it, even run an SQL on it to make sure that you can narrow it down to only the records that you need, and purchase only the records that you need. Subscribe to them. So, you can do it every day, every hour of the year, whatever you want, and pay only for what you need. But not only that, pay, I would say, a third of the cost that it will cost us to scrape it, because we are able to share the cost between many customers. So, it's the most affordable solution we can offer by having a pre-collected dataset, because we are able to share that cost between many customers. So, it will be the most effective, cost-effective solution that you can ever find, instead of paying for the developer and for overcoming the CAPTCHA and the blockage, and then dealing with the quality issues and verification, validation. You don't need to do anything.

[0:29:45] PB: What are the most popular datasets that you currently offer?

[0:29:48] EN: We offer a variety of datasets, I would say, that it ranges from the popular e-commerce websites like Amazon, where we have 300 million products that are great. So, that’s also an enormous scale, that only if you need to do this, it's discovering. Just the discovery process to find 300 million products in order to filter and just find what you need. It's a huge process. Coming into a dataset like that, and being able to write a filter in a minute, and say, “I want only Panasonic TVs over 40-inch”, and get those products is an amazing achievement on its own. We also have the popular LinkedIn profiles and LinkedIn companies and Crunchbase companies. So, we have all the company's data. We have jobs, and hiring data. We have e-commerce product data. Those are the big bulk of categories.

[0:30:47] PB: We spoke earlier about the big use case that you have that is e-commerce. We also mentioned travel. I was wondering what are the most common use cases? If you think, from the point of view of scale, which one is the largest, that is most challenging to accommodate, I would say?

[0:31:02] EN: I think that, as you said, price comparison, and e-commerce is the largest one. For a second, I would like to take us back to reality and not the digital world, but the physical world. In Israel now, we have the holiday. the New Year's, Rosh Hashana, it’s called. On the billboards, we have huge ads for the cheapest basket of supermarket groceries that you can find in the country. The top one and the second one advertising all over the country that they have the cheapest basket of groceries that you can buy for the holiday.

How do they know that they have the cheapest one? So, it's a common way to do it. The common way is mystery shopper will go into the supermarket, will buy 100 items, compare them to other supermarkets, and declare the winner. That mystery shopper is exactly what we do online. Instead of doing it with sending people and marking those prices, we do it with our technical solution and we go and scrape it, and parse it, and have the data, compare it one to another. Now, we have a price comparison.

So, I'm touching a bit of ethics here, we're doing exactly what you can do in every supermarket, in every store that you can go in and watch the prices, and watch how they have it on their shelf and everything, and that's public common knowledge now, because it's out in the open and it's not locked anyway. That's what we do online as well. As you mentioned, price comparison is the leading one. But there’s also ad verification use case, which is a big one. SEO, which is a huge one. We have financial due diligence. So, reviewing a company online assets from the hiring, product reviews, investors, whatnot, to make sure that this company is healthy. For the financial sector, for investing, it’s another huge use case for us.

[0:33:12] PB: And from, because he also mentioned about the ethics and about the mystery shopper going into the shop and just with his own organic humanize scanning through all the prices, and then you just doing it automatically. So, I was thinking that Bright Data, what you're describing is an incredibly powerful tool. Because you have both the proxies, you have the web unlocker, you have this cert, and so on. So, it's an entire suite of very powerful tools and to collect the web data in a very efficient way. I was wondering, how do you ensure that it's used ethically? And how do you deal with it with developers who try to use Bright Data, for example, in non-ethical or harmful ways?

[0:33:47] EN: We are fully aware of the damage that can be made by leveraging millions of IP addresses for harmful attacks, which is why we have compliance department, a big one, and a KYC process, a know your customer process. A lot of technical features that will allow us to circumvent those harmful issues to ever happen.

I'll give you an example. We only want you to access public web data, not under logging. And we do that by limiting the URLs and limiting the data points that you can scrape, and limiting what you can write on the web. Because we don't want you to write any bad reviews, or any reviews at all using our IP addresses.

[0:34:40] PB: Correct because you can not only get the data, you can also create, for example, post requests, or start adding comments and so on.


[0:34:48] EN: Exactly. So, we need to make sure that you're not using our IP addresses to write any reviews on any website and leveraging it to do harm. So, when you ask us to get access, to gain access to the residential network, for instance, which is basically using a real human IP address, his computer. And if you do somehow, the FBI will come to his house, not to your house. That's his IP address. We must protect our end user. We must protect this IP address. The first thing we do when you access to this residential network, for instance, is meet us. Come to a meeting, a video meeting, provide the necessary documentation that will show that your company, which company got the paperwork, but also, let's discuss your use case. Let's make sure that we are aligned, in sync, about what can be done and what shouldn't be done. And the moment you will do something that is harmful, you're off the network forever. That's really important for us.

By the way, I'm sorry to say that it's not so important for our competitors. We have many competitors that don't have KYC, that don't have compliance department, even when they say they do, and they don't have the technical features to prevent you from doing that. We want to be leading the industry in general, but also in the ethics department. And we went so far into having a department, a lead for data for good. We want to, with the great initiative, it’s basically an initiative that uses our super powers for good, and we collaborate with the top universities in the world. All the top universities are working with us, with governments, with organizations, and we want to provide them with access to data that will support fighting human trafficking, helping education, fighting anti-Semitism.

We have so many examples in our website in the [inaudible 0:37:00] initiative. And if you're one of those, and you're listening, and you're one of those organizations, and you want access to information in order to do good with data, because data is the new gold. It’s so powerful today. Please reach out to us because we are happy to collaborate and offer those services for free.

[0:37:19] PB: Perfect. Yes, I think that’s good, and also, I like the fact that you protect towards that one, small residential IP address, even though we have such a number of them. And those people, by allowing other people that they don't know, the companies into the router into making requests, they also take a risk.

[0:37:37] EN: Yes. They take a risk, and we have to protect them. We have to protect them, because that's our livelihood. We care about them more than they know, because we want to make sure that nobody will knock at the door, and then it's on us.

[0:37:50] PB: Also, as you mentioned, the kind of anti-Semitism on the Internet, the amount of, I think, Facebook and Twitter comments if you just go and look, there's a lot of things that literally just deserve to be reported to police and they're not. And the data just hangs over there, right?

[0:38:03] EN: Exactly. You need to remember that most of those social networks don't have an API that will allow you to access the data, and to report issues at scale. It's up to the individuals to go on to report it. We allow them to scrape at scale and search for keywords and make sure that they are alerted on time when such hate speech are being made online, and to report them, and deal with them. We have few organizations here in Israel and other in the US and other places that we work with, provide free access to those resources.

[0:38:42] PB: I agree, because I have personally have reported a lot of stuff. And I was just wondering how it this is even possible that something like this is still hanging around in the Internet and nobody does anything. It's a reputable big social network. So, let's maybe get into starting with Bright Data as a developer. So, imagine I'm a company. I have a developer with my company who would like to scrape some data. What is the developer experience like? How do you start as a software developer with your product? Because we have already mentioned the proxies. We have mentioned in the SERPs, we have the mentioned the IDE for scraping, the web unlocking. It's a mouthful word. And so –

[0:39:17] EN: Yes. A mouthful word that we invented. Unlocking is not an English word.

[0:39:21] PB: It’s not?

[0:39:22] EN: No.

[0:39:23] PB: Okay. So, as a developer, how do I get started? And how do you, as you said, that sometimes developers have problems in choosing the right tools for the job that they're given. So, how do I get started? How do I figure out which part of your solution, and which elements of your toolkit do I need?

[0:39:39] EN: That's a great question. Let's start from the basic. So, onboarding. While you onboard, we offer free credits for you. If you are an individual, you will get, I think, it's $5 or $10 free to use on the platform so you can test anything that you want. And for companies, we give much more credit and time to test it and use it freely, and we encode that, and we provide the best support in the industry, and maybe in other industries. We have an amazing support team. So, I highly recommend using that. Also, now, my department is working on improving the documentation to make sure that developers have amazing documentation to start with.

But now, the big question is, which products should I start with? And if you're a professional, that's less of an issue, but let's talk about you as –if you're working for a company, and you need now to scrape at scale, what should you do? I think that let's start from the most efficient one. That is hard for developer to think like that, because it's not a developer tool. It's a pre-made dataset. As I mentioned, it's already there. You only – you can filter it with a GUI, you go and just filter it by the product name, by hashtag, whatever you want, and get the data you need immediately, or subscribe to it, or ask us to get fresh data on the spot, which cost a little bit different. And that will be the most cost effective, cost efficient, everything is there, you don't need to develop anything, and that's amazing for the business, and you're done. You are magically done.

But that's limited for a set list of, let's say, up to 100 top datasets, and it's going to be 300 by the end of the year, or maybe a little bit later. Then, if you want us to do it for you, again, we can even do it for a custom dataset. Let's say we don't have a pre-made to collect the dataset, you can request a customer and we'll manage it for you, we'll develop the scraping code, we’ll manage it, we’ll maintain the code if something breaks because the website keep on changing and that's a managed solution and it's also very cost effective. You don't need the developer on it. You just need your own time to just tell us what you want to scrape, what is it schema of the data that you want, and that's it, basically.

But that's great when you have a need for data that is not called crucial to your business. Because when it's called crucial to your business, you want to make sure that you have more control. So, I will say that we have two parameters to look at when choosing a product. One is the level of control needed, and how crucial it is to your business, to make sure that you have full control on every little aspect. Of course, there is price as a parameter, and what type of team do you have? So, do you have engineering power? Do you have data science? What does your team looks like?

So, if your team is just a business set, you’re just a business person, then pre-made or managed custom datasets are perfect for you. If you have developer power, then you can deal with any of the products. If you have just data engineering or data science, you can also deal with the pre-collected custom dataset. But you might not be able to develop using the IDE, because you need to write in JavaScript, or the proxies, where you need to do it manually.

We haven't touched on a very interesting product that we released this year, which is the scraping browser, which you write scraping interactions using, let's say, Puppeteer. And then we scale it for you using a scraping browser API. You just call the API with your ready-made code on your end in Puppeteer, let's say, or Selenium. Then, we scale it to whatever scale you want, using real browsers, on machines that are connected to our ports and networks, and you don't need to maintain any of that, and we keep it up to date.

[0:43:56] PB: Does it also include all the benefits that come from the web unlocking and the process –

[0:44:02] EN: Yes. It’s connected to the login. And then you don't need to deal with the scaling of interactions, which is another huge use case in scraping.

[0:44:12] PB: So, I was wondering as a developer, right? So, I come in and I want to use your product. Because obviously, there's so many different aspects of the product. So, you have the ready datasets, which are kind of like cash, you save electricity, you save money, share the cost between different people, and boom, you get it. Then, however, occasionally, you still need to write something. So, I think, you are speaking about this scraping web browser and how the scraping browser helps you to work with that.

[0:44:35] EN: Yes. Let's say with the control that you need and your capability. If you need a lot of control, then you need to go low level, as low level as you can. So, you will only use the proxy networks. If you have the capability of doing the unlocking and fingerprinting and CAPTCHA solving, everything you can do yourself, and you need to manage it yourself. Then all you need is the IP addresses to make it scale and you can deal with the datacenter, ISPs, mobile or residential.

If you need us to help you with unlocking because we have a huge department that is dealing with ML and research and everything is doing that. Then, you can use the web unlocking tool, and that's an API that you will call, get the HTML back. Then, all you need to do is basically just run the API and do the parsing yourself. If you're targeting a search engine, then you don't even need to do that. You don't need to do the parsing, because we did the parsing for you, and you get the JSON file.

That's in the proxy wall. Then you can go to the IDE, because you say, I don't even want to manage my own infrastructure. I don't want to call your API. I want to write everything on the cloud, in the Bright Data IDE platform, write it in JavaScript, leverage all the pre-made interactions or not interactions, functions that you have, and run it at any scale, at any schedule with an API when activated and initiate the data collection. Send it automatically to – deliver to any dataset that I want as a zip file to S3, with the naming conventions that I want, and that's it. So, I don't need to manage the infrastructure. I can focus only on the JavaScript code that will do the interactions or will do the parsing. So, the range is really wide, but depends on the level of control that you need and your capabilities.

[0:46:38] PB: I think, very dependent, because as you said, your clients or the people who use your product can range just from a business person that wants to query the data, all the way down to a developer who's advance and a web scraping expert, and just wants to use your scale.

[0:46:53] EN: Yes. We have customers that they rank from of team of 70 people that are doing scraping for a huge business that isn't an e-commerce mogul that needs to keep track on all the prices in the market, and have a huge scraping department with machine learning to also deal with the pricing issues and all of that, all the pricing logic. We have, all the way, to a single developer that doesn't have the capability to deal with all that locking himself, and do the research, and do the scaling, and have the infrastructure, and we have all those solutions for them.

[0:47:26] PB: This 360 is really, it makes sense. If you think about it as a 360 for all the types of the customers, and as well with all the types of the products. I was wondering, let's speak a little bit about your team and how we work together. How large is your team and do you work within the office? Do you work remote? Are you just in Israel, or you spread around couple countries?

[0:47:46] EN: Great question. So, my team specifically is the product manager’s team. We are eight PMs. We have two product designers. We have content strategies. But we work with engineering. So, we have about 100 engineers in the company, that are writing all this magic behind the scenes. And they are spread around the world. We have a big team in Israel, but we also have a team in Europe, in Canada, in the US, and in many, many places, and they work remotely. And our company DNA is very specific and it's optimized to be able to work remotely entirely. But I'm happy to say that my team is working hybrid.

So, we're working like two three days at home or in the office. I prefer it, actually, I prefer the hybrid approach. You can have the both. And yes, we are constantly collaborating with all the teams. My team, especially, is in the center of working with marketing and sales, and account managers and support. R&D is our closest allies and partners and we need to work with them closely on the roadmap and execution, make sure that everything goes according to plan, and if not change it as fast as possible.

[0:49:04] PB: Because you mentioned, that you guys, within the product area, you’re kind of in the middle and you’re bridging and connecting the engineering with the actual user requirements, possibly with the requirements that come from sales and so on.

[0:49:16] EN: Yes. So, we have a lot of requirements coming from everywhere. Basically, we have them coming from sales and account managers. From the data itself, we have all the analytics to make sure that we have. But that will account only to 80% of what we do. The 20% is innovation, and innovation comes from what the client, what the customers don't even know how to say that they want it. You need to find it between the lines, and the stuff that they're doing, and that you know that if you can build it, it will solve the love issues for them. So, we need to keep our ears open for those 20% that will allow us to keep innovating and building the next product. And I'm happy to say that we are leaving the industry and we have products that our competitors take few years to build and come up with, after we launched them.

[0:50:10] PB: Because you said that you have around 100 people globally, within the engineering team. I’m wondering, are you currently hiring? And if you are, what does the recruitment process look like?

[0:50:21] EN: So yes, we will always hiring, especially for good engineers. We have a very straightforward hiring process, which is doing an exercise. I think, what's not common about it is that you need to write in JavaScript, which is not common, I think, in the industry. So, that's a bit of a challenge, because you need to be well-versed in the language. But other than that, we look for amazing engineers that can fit the DNA. So, I would recommend everyone to read our publicly accessed –

[0:51:00] PB: To scrape your public data.

[0:51:02] EN: Exactly. You can scrape our DNA and read it before you apply, because it's very specific. We like to work very independently, and everyone is an owner of their own domain, and they need to work and lead that and come up with the solutions and not only rely on discussions. We are a no meeting company. There are meetings. Three or four people are not allowed to do it. I'm exaggerating. You're allowed to meet. But we don't have recurring meetings. We don’t have large meetings. We have one to once a week, and I need you to ask that question. Let's move forward. Because everyone is the owners of their task, of their project, of their future, and that's how we operate. So, we try not to lose time and speed over large meetings.

[0:51:53] PB: The process is very pragmatic and optimized to be efficient with people's time as well?

[0:51:58] EN: Yes. I think that developers love not wasting times in large meetings. Now, I come from Facebook, where we have my time as a product manager was 90% of the time in meetings of seven to nine people, discussing things with everyone. I think that we wasted so much time for developers and designers that didn't have to be in the meeting.

[0:52:21] PB: I think, this touches on the concept of – and Paul Graham wrote this famous essay about the makers schedule on the manager schedule, and how the makers and the people who actually create things and work. So, let's say like software engineers, artists, and so on, how they work in a completely different schedule, and how hard it is to mix those two worlds together occasionally, which you have to do, to build a product.

[0:52:40] EN: Yes. So, the developer’s time is sacred to us. I truly believe that and I want to make sure that we're not abusing their time in any way on, anything that is not useful. Because at the end of the day, to execute the roadmap, I can't do it alone. I can't do it at all. I can talk about it, but I can do it. So, I need the help of the engineers. If I'll take the time on stuff that I needed, then I help myself, basically.

[0:53:09] PB: And I was wondering as well, what is next on your plate for Bright Data in terms of what you're building and where you want to take the company? And how do you see Bright Data in the future?

[0:53:18] EN: I think that the core challenge and the thing that I'm most excited about is integrating the AI elements into our product. And we found so many useful use cases, that I believe that the next year will be, or the next couple of years will be just figuring out how to leverage it and optimize it to the benefit and value of our customers. For instance, I'll give you a really fun example. In the pre-collected datasets, you can press a button, add a new column. You can call it, let's say, material. You didn't have this column before. Now, you're looking at the e-commerce dataset that you just collected or was pre-collected, and you have a description of the product. And you want to extract only the materials and another column for [inaudible 0:54:12]. The website didn't give you a material and not [inaudible 0:54:14]. Now, you can do it. You can press a button, say, I want to – it's like AI enhanced column and you say exactly the prompt that will be run on the original description for example, and say, “Extract only the color of the product and with the prompt and the rules that you want to have.” The only the materials and limited to those five materials, and if not, put other. On the spot, in front of your eyes, it will happen. It will write line by line very, very fast, and it will give you a way to experiment and to do AI scale, without training any model instantly. It's really amazing, how we moved from ML, trained the models that took forever to make, into prompting and generating a new column instantly on any dataset that you just now collected.

[0:55:13] PB: It's also going to think work much better with any kind of fuzzy data? So, if some term is a bit more potentially ambiguous, if there are any typos or some other word is used, within, let's say, for example, the description column?

[0:55:25] EN: Yes. So, this enhanced column is something that I'm excited about, and we are now launching, and just imagine a cold helper that will allow you to write better code for interactions, or for passing within our IDE folder, or there are so many examples along the way.

The last one will be insights. At the end of the day, it's not the data that you're looking for, it’s he insights. And AI can help you extract the insights from that data, and we have a line of products for that. We have Bright Insights, which is now tailored just for e-commerce, but it will expand over time. Yes, I think it's the next frontier. At the end of the day, you're looking for insights, and we want to provide you as close as we can to those insights.

[0:56:13] PB: It does make sense. Because you have the data, the data has the information, right? Also, you mentioned in the beginning that you are planning a conference. Could you speak more of it?

[0:56:24] EN: So, this Scrape Con Event is our first big online event November 7th. You are all invited to take part in. With industry leaders, it's not only us talking. We invited our customers, we invited industry leaders, we invited influencers, from that community to participate in the discussions and keynote speakers. I think it will be extremely valuable to anyone in that domain and looking to understand how others are looking at it, how they're solving big issues, and scaling the solution.

[0:57:00] PB: Amazing. How can I sign up for that event? Is it on your website?

[0:57:04] EN: Yes. Just go to our website, brightdata.com and look for the Scrape Con Event and register.

[0:57:10] PB: Perfect. Thank you so much for your time, Erez. This was very interesting.

[0:57:14] EN: Thank you very much for inviting me.

[END]